| 研究生: |
劉榮修 Jung-Hsiu Liu |
|---|---|
| 論文名稱: |
一種網頁資訊擷取程式之自動化產生技術研發 An automatic wrapper generation for web information extraction |
| 指導教授: |
陳奕明
Yi-Ming Chen |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 畢業學年度: | 90 |
| 語文別: | 中文 |
| 論文頁數: | 105 |
| 中文關鍵詞: | 網頁資訊擷取 、擷取程式 、自動化產生技術 |
| 外文關鍵詞: | wrapper, web information extraction, automatic generation |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
網際網路是相當巨大資訊貯藏庫,蘊含著豐富的資料,其中有資訊檢索、資訊擷取、資訊整合、及資訊探勘等領域的研究。目前擷取網頁資訊的方式多是採用擷取程式(Wrapper),近年來也有相當多的研究針對產生擷取程式作設計與探討,本研究針對眾多的研究文獻將產生擷取程式的方法分成四類,自動分析學習、樣本歸納學習、手動式建立規則、與輔助式建立規則。不過各項研究各有優劣,綜合來看,常見的缺點有適用領域過小,需要建立樣本來作為學習的依據,或者是需要手動的方式來自行建立擷取規則。本研究的目的是為了解決上述的缺點,設計互動式的介面來自動產生擷取規則,以網頁標籤樹狀結構來表示各類網頁格式的資訊位置,以提高可適用的網頁格式範圍,另外提供直覺式的操作介面讓使用者完成擷取設定,更為輕鬆、簡便。最後本研究與同樣提供介面輔助的系統作評估,以說明本系統的設計功能更為強大,使用更為方便,也與WIEN系統比較,以驗證本系統的有效性與可用性。
WWW covers huge information. And web information extraction is an important issue in WWW. But we found some drawbacks to this issue from many researches. The drawbacks include less applicable domain, sample learning cost, and handcrafting rules. So we present an approach to generate wrappers for web information extraction. Our contribution are as follow: (1)developing interactive interface to generate extraction rules automatically without any samples; (2)the extraction rules can be applicable many kinds of web formats. The final, we measure some web sites to test the applicability of our wrapper generation system.
[AI 1999] Douglas E. Appelt, David J. Israel, “Introduction to Information Extraction Technology”, International Joint Conference on Artificial Intelligence (IJCAI-99) Tutorial, Stockholm, Sweden, 1999.
Access from http://www.ai.mit.edu/people/jimmylin/papers/intro-to-ie.pdf on June 2002.
[AK 1997] Naveen Ashish, Craig Knoblock, “Semi-automatic Wrapper Generation for Internet Information Sources”, Conference in Cooperative Information Systems, pp. 160-169, 1997.
[AK 1997-2] Naveen Ashish, Craig Knoblock, “Wrapper Generation for Semi-structured Internet Sources”, Proc. Workshop in Management of Semi-structured Data, 1997.
Access from http://citeseer.nj.nec.com/78296.html on June 2002.
[BGRV 1999] Laura Bright, Jean-Robert Gruser, Louiqa Raschid, Maria Esther Vidal, “A wrapper generation toolkit specify and construct wrappers for web accessible data sources (WebSources)”, International Journal of Computer Systems Science and Engineering, Vol. 14, No. 2, pp. 83-97, 1999.
[BHC 1996] Robin D. Burke, Kristian J. Hammond, Edwin Cooper, “Knowledge-based information retrieval from semi-structured text”, AAAI/IAAI, Vol. 1, pp. 462-468, 1996.
[BLG 1998] Kurt D. Bollacker, Steve Lawrence, and C. Lee Giles, “Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publications”, Proceedings of the 2nd International Conference on Autonomous Agents, ACM Press, pp.116-123, 1998.
[CERT/CC] CERT Coordination Center, http://www.cert.org/.
[CGL 1998] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, Riccardo Rosati, “Description Logic Framework for Information Integration”, Principles of Knowledge Representation and Reasoning, pp. 2-13, 1998.
[Childlovskii 2000] Boris Chidlovskii, “Wrapper Generation by k-Reversible Grammar Induction”, In ECAI2000 workshop on Machine Learning for Information Extraction, 2000.
Access from http://citeseer.nj.nec.com/469912.html on June 2002.
[Ciravegna 2000] Fabio Ciravegna, “Learning to Tag for Information Extraction from Text”, In ECAI2000 workshop on Machine Learning for Information Extraction, 2000.
Access from http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html on June 2002.
[CL 1996] Jim Cowie, Wendy Lehnert, “Information Extraction”, Communications of the ACM, Vol. 39, No. 1, pp. 80-91, 1996.
[Cohen 1998] William W. Cohen, “A web-based Information system that reasons with structured collection of text”, Proceedings of the 2nd International Conference on Autonomous Agents (Agents''98), pp. 400-407, 1998.
[CRR 2000] Boris Chidlovskii, Jon Ragetli, Maarten de Rijke, “Wrapper Generation via Grammar Induction”, European Conference on Machine Learning, pp. 96-108, 2000.
[CS 1998] Liren Chen, Katia Sycara, “WebMate:A Personal Agent for Browsing and Searching”, Proceedings of the Second International Conference on Autonomous Agents,. ACM Press, May 1998.,pp.132-139 ,1998.
[ECJ+ 1999] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. Smith, “Conceptual-model-based data extraction from multiple-record Web pages”, Data & Knowledge Engineering, Vol. 31, pp. 227-251, 1999.
[Eikvil 1999] Line Eikvil, “Information Extraction from world wide web -A Survey-”, Norwegian Computing Center, No. 945, July 1999.
Access from http://citeseer.nj.nec.com/eikvil99information.html on June 2002.
[Etzioni 1996] Oren Etzioni, “The World Wide Web: quagmire or gold mine?”, Communications of the ACM, Vol. 39, No. 11, pp. 65-68, 1996.
[FHK+ 1997] Jürgen Frohn, Rainer Himmeröder, Paul-Th. Kandzia, Georg Lausen, Christian Schlepphorst, “FLORID - A Prototype for F-Logic”, In Intl. Conference on Data Engineering (ICDE), 1997.
Access from http://citeseer.nj.nec.com/frohn97florid.html on June 2002.
[GetRight] GetRight-Download Manager program, http://www.getright.com/.
[GMV 2000] Alejandro Gutierrez, Regina Motz, Daniel Viera, “Building Databases with Information Extracted from Web Documents”, Computer Science Society (SCCC ‘00), pp.41-49, 2000.
[GS 1999] Xiaoying Gao, Leon Sterling, “Semi-Structured data extraction from heterogeneous sources”, 2nd International Workshop on Innovative Internet Information Systems (IIIS''99), 1999.
Access from http://citeseer.nj.nec.com/gao99semistructured.html on June 2002.
[GW 1998] Robert Gaizauskas, Yorick Wilks, “Information Extraction: Beyond Document Retrieval”, Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, pp. 17-60, August 1998.
[GW 1999] Tao Guan, Kam-Fai Wong, “KPS: a Web information mining algorithm”, Computer Networks, Vol. 31, pp. 1495-1507, 1999.
[HMC+ 1997] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo, “Extracting Semistructured Information from the Web”, Proceedings of the Workshop on Management of Semistructured Data, 1997.
Access from http://citeseer.nj.nec.com/hammer97extracting.html on June 2002.
[HTML401] HTML 4.01 Specification, http://www.w3.org/TR/html401/.
[ISS] ISS Security Center, http://www.iss.net/.
[KB 2000] Raymond Kosala, Hendrik Blockeel, “Web Mining Research: A Survey”, ACM SIGKDD Explorations, Vol. 2, Iss. 1, pp. 1-15, July 2000.
[KC 2001] Yong Hae Kong, In Seok Choi, “An efficient Web information extracting system”, Proceedings of IEEE International Symposium on Industrial Electronics (ISIE 2001), Vol. 3, pp. 1771-1774, 2001.
[KS 1997] Paul-Th. Kandzia, Christian Schlepphorst, “FLORID - A Prototype for F-Logic”, 12th German Workshop on Logic Programming (WLP ‘97), pp. 17-19, September 1997.
[Kushmerick 1997] Nicholas Kushmerick, “Wrapper Induction for Information Extraction”, Ph.D. dissertation, University of Washington, 1997.
[Kushmerick 2000] Nicholas Kushmerick, “Wrapper Induction: Efficiency and Expressiveness”, Artificial Intelligence, Vol. 118, Iss. 1-2, pp. 15-68, April 2000.
[KWD 1997] Nickolas Kushmerick, Daniel S. Weld, Robert Doorenbos, “Wrapper Induction for Information Extraction”, Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729-737, 1997.
[LHL+ 1998] Bertram Ludascher, Rainer Himmeroder, Georg Lausen, Wolfgang May, Christian Schlepphorst, “Managing Semistructured Data With Florid: A Deductive Object-Oriented Perspective”, Information System, Vol. 23, No. 8, pp. 589-613, 1998
[LLG 1999] Mengchi Liu, Tok Wang Ling, Tao Guan, “Integration of semistructured Data with Patial and Inconsistent Information”, Database Engineering and Applications, pp. 44-52, 1999.
[LP 1997] Ling Liu, Calton Pu, “An Adaptive Object-oriented Approach to Integration and Access of Heterogeneous Information Sources”, Distributed and Parallel Databases, Vol. 5, No. 2, pp. 167-205, 1997.
[LPT+ 1998] Ling Liu, Calton Pu, Wei Tang, David Buttler, John Biggs, Tong Zhou, Paul Benninghoff, Wei Han, “CQ: A Personalized Update Monitoring Toolkit”, In Proceedings of ACM SIGMOD Conference, 1998.
Access from http://citeseer.nj.nec.com/liu98cq.html on June 2002.
[May 1999] Wolfgang May, “Modeling and Querying Structure and Contents of the Web”, IEEE Internet Computing, pp. 721-725, 1999.
[May 2000] Wolfgang May, “An integrated architecture for exploring, wrapping, mediating and restructuring information from the Web”, Database Conference, pp. 82-89, 2000.
[Openfind] Openfind網路資訊搜尋網站, http://www.openfind.com.tw/.
[PL 1998] Calton Pu, Ling Liu, “Update Monitoring: The CQ Project”, The 2nd International Conference on Worldwide Computing and Its Applications - WWCA''98, Tsukuba, Japan, Lecture Notes in Computer Science, Vol. 1368, pp. 396-411, 1998.
[Poibeau 2000] Thierry POIBEAU, “Corpus-based Learning for Information Extraction”, Actes du workshop Machine Learning for Information Extraction (ML4IE), 14th European Conference on Artificial Intelligence (ECAI’2000), Berlin, 2000.
Access from http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html on June 2002.
[RN 1998] Anand Rajaraman, Peter Norvig, “Virtual database technology: transforming the internet into a database”, IEEE Internet Computing, Vol. 2, Iss. 4, pp. 55-58, July-Aug. 1998.
[Singh 1998] Narinder Singh, “Unifying heterogeneous information models”, Communications of the ACM, Vol. 41, No. 5, pp. 37-44, May 1998.
[Soderland 1997] Stephen Soderland, “Learning to Extract Text-based Information from the World Wide Web”, Knowledge Discovery and Data Mining, pp. 251-254, 1997.
[Teleport Pro] Teleport Pro-Offline Browser Webspider, http://www.tenmax.com/teleport/pro/home.htm.
[Tidy] HTML Tidy, http://www.w3c.org/People/Raggett/tidy/.
[TSIMMIS] TSIMMIS Project, http://www-db.stanford.edu/tsimmis/tsimmis.html.
[Yahoo 股市] Yahoo奇摩股市, http://tw.stock.yahoo.com/.
[YCO 2001] Jaeyoung Yang, Joongmin Choi, Heekuck Oh, “MORPHEUS:A customized comparison-shopping agent”, The 5th International Conference on Autonomous Agents (Agents-2001), Montreal, Canada, pp. 63-64, 2001.
[YLC 2000] Jaeyoung Yang, Eun-seok Lee, Joong-min Choi, “A Shopping Agent That Automatically Constructs Wrapper for Semi-Structured Online Vendors”, Lecture Notes in Computer Science, Vol. 1983, pp. 368-373, 2000.
[李明德 1998] 李明德,“網際網路上半結構化資料的擷取、管理與呈現系統”,國立中央大學資訊管理學系研究所碩士論文,民國87年6月。
[呂紹誠 2001] 呂紹誠,“網際網路半結構性資料擷取系統之設計與實作”,國立中央大學資訊工程學系研究所碩士論文,民國90年6月。
[范綱岷 2001] 范綱岷,“使用超本文標記語言剖析樹建構多網頁資訊萃取及融合代理人”,國立台灣科技大學電子工程學系研究所碩士論文,民國90年。
[顏逸品 2000] 顏逸品,“網際網路半結構化資料之蒐集與整合系統”,國立中央大學資訊管理學系研究所碩士論文,民國89年6月。