線上擷取規則分析｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	郭釋謙 Shih-Chien Kuo
論文名稱：	線上擷取規則分析 On-Line Extraction Rule Analysis
指導教授：	張嘉惠 Chia-Hui Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
畢業學年度：	91
語文別：	中文
論文頁數：	39
中文關鍵詞：	資訊整合、資料檢索、資訊擷取
外文關鍵詞：	Information Integration, Information Extraction, Information Retrieval
相關次數：	點閱：15 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著網際網路的發展，越來越多的資訊以HTML的格式來呈現，有用與無用的資訊參雜其中，使用者往往可能花上大筆的時間在找尋資料，因此，透過資訊擷取系統的設計，將輸入的資料以結構化的方式呈現，進而整合資料，建構豐富的搜尋引擎。
設計資訊擷取系統，最直接的方法是針對各個網站利用人工撰寫擷取資料的包覆程式(Wrapper)，但是由於網站的格式隨時有可能發生更改，因此如何快速並且自動地產生擷取程式是設計擷取系統最大的挑戰。
從1997年開始，Wrapper Induction的方法被提出，利用標示範例網頁，告訴系統要擷取的資訊，讓系統產生擷取規則，接著利用擷取規則來擷取網站的資訊。這類利用標示範例網頁的方式，雖然有不錯的擷取率，但是必須經過十分繁複的標示動作，才能產生擷取規則，因此對使用者來說，並不是那麼的便利，因此減少使用者標示的資訊擷取系統是系統設計的一大挑戰，目前不用使用者標示的系統如IEPAD等僅能解決多筆紀錄的網頁，對於單一紀錄網頁尚無解決辦法，有鑑於此，本篇論文提出一個有效的方法來完成自動化的資訊擷取系統(Information Extraction System)，讓使用者不必經過繁複的標示動作便可將資料完整的擷取到手，同時解決單一記錄以及多筆記錄的網頁擷取問題。

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. However, the design of an IE system differs greatly according to its input: from unrestricted free-text to semi-structured Web documents. This paper extends an automatic pattern discovery approach called IEPAD to the rapid generation of IE systems that can extract structured data from semi-structured Web documents. In this novel framework, extraction rules can be trained not only from a multiple-record Web page but also from multiple single-record Web pages (called singular pages). Most of all, this framework requires no annotation labor that is required for many machine-learning based approaches. Evaluation results show a high level of system performance.

第1章 緒論			1
第2章 相關研究討論			4
2.1 使用者標示動作之資訊擷取系統	4
2.2 免標示動作之資訊擷取系統	6
2.3 WysiWyg的資訊擷取系統		9
第3章 系統架構			14
3.1 範例				14
3.2 目標區域框選(Enclosing)		16
3.3 Generalization			20
3.4 細部資料指定			24
3.5 多重Enclosing			25
3.6 擷取規則			26
第4章 擷取器			27
第5章 實驗結果與問題討論		29
5.1 擷取Multiple-Record Pages	29
5.2 擷取Singular Pages		32
第6章 結論與未來展望		36
參考文獻				37

                                

[1] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8–15, 1997.
[2] R. Baumgartner, S. Flesca, and G. Gottlob. Supervised wrapper generation with lixto. In Proceedings of VLDB Demo, 2001.
[3] C.-H. Chang and S.-C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681–688, Hong-Kong, May 2–6 2001.
[4] B. Chidlovskii, J. Ragetli, and M. Rijke. Automatic wrapper generation for web search engines. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM’2000), LNCS Series, Shanghai, China, 2000.
[5] D. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 467–478, Philadelphia, PA, 1999.
[6] D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
[7] C.-N. Hsu and C.-C. Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.
[8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
[9] I. Muslea, S. Minton, and C. Knoblock. A hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250-257, VA, USA, 2000.
[10] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold. Jedi: Extracting and synthesizing information from the web. In Proc. of COOPIS, 1998.
[11] C. Knoblock, S. Minton, and et al. J. Ambite. Modeling web sources for information integration. In Proceedings of the 15th National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, pages 211–218, Wisconsin, USA,1998.
[12] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, Japan, 1997.
[13] W.-Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250–257, VA, USA, 2000.
[14] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of ICDE, 2000.
[15] W. May, R. Himmeroder, G. Lausen, and B. Ludascher. A unifed framework for wrapping, mediating and restructuring information from the web. In Proc. of WWWCM, 1999.
[16] A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of VLDB, 1999.
[17] A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–316, 2001.
[18] S. Soderland. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages, 233–272, CA, USA, 1997.
[19] S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999.
[20] G. Gonnet, R. Baeza-Yates, and T.Snider, New Indices for Text: PAT Trees and PAT Arrays, In Bill Frakes, and B.Y. Ricardo, editor, Information Retrieval: Data structures and Algorithms, Prentice Hall, Englewood Cliffs, Chapter 5 (pp. 66-82), NJ, USA, 1992.
[21] World Wide Web consortium (W3C), http://www.w3c.org

簡易檢索 / 詳目顯示

相關論文