| 研究生: |
陳明權 Ming-chuan Chen |
|---|---|
| 論文名稱: |
應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取 Exploiting Dynamic Encoding and Multiple Pages for Record Boundary Detection and Data Extraction |
| 指導教授: |
張嘉惠
Chia-hui Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2014 |
| 畢業學年度: | 102 |
| 語文別: | 中文 |
| 論文頁數: | 45 |
| 中文關鍵詞: | 記錄範圍偵測 、動態編碼 、資訊擷取 |
| 相關次數: | 點閱:10 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
記錄範圍偵測在擷取器推導(Wrapper Induction)中是一個很重要的環節,偵測的結果好壞會直接影響後續的排比以及最後的準確度。過去的方法多為在單一網頁中進行各個區塊相似度計算,擁有的資訊量較少,而使用樹狀結構的相似度計算也會造成計算量的上升。在本篇論文中我們參考來自同個網站的多個網頁,分析出各網頁中共同與差異的部分,克服單一網頁所缺乏的資訊;同時為減少多個網頁增加的計算量,系統分析的主要對象為DOM樹中的葉節點,其數量僅為所有節點的三成。藉由葉節點在多個網頁的分佈情形,本文提出動態編碼,對葉節點進行抽象化,用以突顯記錄的規律性,使得重複樣式探勘能得到較好的成效。最後對於記錄範圍的偵測,本文提出地標的概念,根據存在於各筆記錄中的地標,並藉由在樹狀結構中的走訪來推測相應的記錄範圍。在實驗與評估的部分,本篇論文使用了知名的資料集與過去幾個系統比較,皆能達到不錯的準確率。
Record boundary detection plays an important role in wrapper induction and the quality of record boundary detection will affect the precision of alignment and extraction directly. Previous approaches usually focus on calculating similarity between blocksor measure tree similarity in a single page.
In this paper, we analyze multiple pages that are generated by the same website. By exploring common parts and different parts in pages, we can overcome the weakness in single-page approaches. Because the computation load will increase when we deal with more pages, the proposed approach only focus on leaf nodes in DOM tree, which are about 30 percent of all nodes. We propose dynamic encoding, which can abstract leaf nodes and emphasize the regularity of every data records. With the dynamic encoding, we reduce the numberof the repeated pattern discovered. Finally, we propose the idea of landmark, which is located in the data record, and detecting the record boundary by segmenting the DOM tree. In the experiment, we evaluate the efficiencyin our approach and compare the effectivenesswith other systems.
1. A. Arasu and H. Garcia-Molina, "Extracting structured data from Web pages", Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp.337-348, San Diego, California, 2003
2. G.O. Arocena and A.O. Mendelzon, "WebOQL: restructuring documents, databases and Webs", Data Engineering, 1998. Proceedings., 14th International Conference on, 24-33, 1998.
3. L. Bing, et al., "Towards a unified solution: data record region detection and segmentation", Proceedings of the 20th ACM international conference on Information and knowledge management, pp.1265-1274, Glasgow, Scotland, UK, 2011
4. A. Carlson and C. Schafer, "Bootstrapping Information Extraction from Semi-structured Web Pages", Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, pp.195-210, Antwerp, Belgium, 2008
5. C.H. Chang, et al., "A Survey of Web Information Extraction Systems", Knowledge and Data Engineering, IEEE Transactions on, Vol 18(10), pp.1411-1428, 2006
6. C.H. Chang and S.C. Kuo, "OLERA: Semisupervised Web-Data Extraction with Visual Support", IEEE Intelligent Systems, Vol 19(6), pp.56-64, 2004
7. C.H. Chang and S.C. Lui, "IEPAD: information extraction based on pattern discovery", Proceedings of the 10th international conference on World Wide Web, pp.681-688, Hong Kong, Hong Kong, 2001
8. W.W. Cohen, et al., "A flexible learning system for wrapping tables and lists in HTML documents", Proceedings of the 11th international conference on World Wide Web, pp.232-241, Honolulu, Hawaii, USA, 2002
9. V. Crescenzi, et al., "RoadRunner: Towards Automatic Data Extraction from Large Web Sites", Proceedings of the 27th International Conference on Very Large Data Bases, pp.109-118, 2001
10. P. Gulhane, et al., "Exploiting content redundancy for web information extraction", Proc. VLDB Endow., Vol 3(1-2), pp.578-587, 2010
11. C.N. Hsu and M.T. Dung, "Generating finite-state transducers for semi-structured data extraction from the Web", Inf. Syst., Vol 23(9), pp.521-538, 1998
12. M. Kayed and C.H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(2), pp.249-263, 2010
13. B. Liu, et al., "Mining data records in Web pages", Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.601-606, Washington, D.C., 2003
14. L. Liu, et al., "XWRAP: an XML-enabled wrapper construction system for Web information sources", Data Engineering, 2000. Proceedings. 16th International Conference on, 611-621, 2000.
15. W. Liu, et al., "ViDE: A Vision-Based Approach for Deep Web Data Extraction", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(3), pp.447-460, 2010
16. A. Machanavajjhala, et al., "Collective extraction from heterogeneous web lists", Proceedings of the fourth ACM international conference on Web search and data mining, pp.445-454, Hong Kong, China, 2011
17. G. Miao, et al., "Extracting data records from the web using tag path clustering", Proceedings of the 18th international conference on World wide web, pp.981-990, Madrid, Spain, 2009
18. I. Muslea, et al., "Hierarchical Wrapper Induction for Semistructured Information Sources", Autonomous Agents and Multi-Agent Systems, Vol 4(1-2), pp.93-114, 2001
19. J. Raposo, et al., "The Wargo system: semi-automatic wrapper generation in presence of complex data access modes", Database and Expert Systems Applications, 2002. Proceedings. 13th International Workshop on, 313-317, 2002.
20. A. Sahuguet and F. Azavant, "Building intelligent web applications using lightweight wrappers", Data Knowl. Eng., Vol 36(3), pp.283-316, 2001
21. K. Simon and G. Lausen, "ViPER: augmenting automatic information extraction with visual perceptions", Proceedings of the 14th ACM international conference on Information and knowledge management, pp.381-388, Bremen, Germany, 2005
22. H.A. Sleiman and R. Corchuelo, "A Survey on Region Extractors from Web Documents", Knowledge and Data Engineering, IEEE Transactions on, Vol 25(9), pp.1960-1981, 2013
23. H.A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Knowledge-Based Systems, Vol 39(0), pp.109-123, 2013
24. S. Soderland, "Learning Information Extraction Rules for Semi-Structured and Free Text", Mach. Learn., Vol 34(1-3), pp.233-272, 1999
25. J. Wang and F.H. Lochovsky, "Data extraction and label assignment for web databases", Proceedings of the 12th international conference on World Wide Web, pp.187-196, Budapest, Hungary, 2003
26. Y. Yamada, et al., "Testbed for information extraction from deep web", Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp.346-347, New York, NY, USA, 2004
27. Y. Zhai and B. Liu, "Web data extraction based on partial tree alignment", Proceedings of the 14th international conference on World Wide Web, pp.76-85, Chiba, Japan, 2005
28. H. Zhao, et al., "Fully automatic wrapper generation for search engines", Proceedings of the 14th international conference on World Wide Web, pp.66-75, Chiba, Japan, 2005