應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳明權 Ming-chuan Chen
論文名稱：	應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取 Exploiting Dynamic Encoding and Multiple Pages for Record Boundary Detection and Data Extraction
指導教授：	張嘉惠 Chia-hui Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2014
畢業學年度：	102
語文別：	中文
論文頁數：	45
中文關鍵詞：	記錄範圍偵測、動態編碼、資訊擷取
相關次數：	點閱：10 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

記錄範圍偵測在擷取器推導(Wrapper Induction)中是一個很重要的環節，偵測的結果好壞會直接影響後續的排比以及最後的準確度。過去的方法多為在單一網頁中進行各個區塊相似度計算，擁有的資訊量較少，而使用樹狀結構的相似度計算也會造成計算量的上升。在本篇論文中我們參考來自同個網站的多個網頁，分析出各網頁中共同與差異的部分，克服單一網頁所缺乏的資訊；同時為減少多個網頁增加的計算量，系統分析的主要對象為DOM樹中的葉節點，其數量僅為所有節點的三成。藉由葉節點在多個網頁的分佈情形，本文提出動態編碼，對葉節點進行抽象化，用以突顯記錄的規律性，使得重複樣式探勘能得到較好的成效。最後對於記錄範圍的偵測，本文提出地標的概念，根據存在於各筆記錄中的地標，並藉由在樹狀結構中的走訪來推測相應的記錄範圍。在實驗與評估的部分，本篇論文使用了知名的資料集與過去幾個系統比較，皆能達到不錯的準確率。

Record boundary detection plays an important role in wrapper induction and the quality of record boundary detection will affect the precision of alignment and extraction directly. Previous approaches usually focus on calculating similarity between blocksor measure tree similarity in a single page.
In this paper, we analyze multiple pages that are generated by the same website. By exploring common parts and different parts in pages, we can overcome the weakness in single-page approaches. Because the computation load will increase when we deal with more pages, the proposed approach only focus on leaf nodes in DOM tree, which are about 30 percent of all nodes. We propose dynamic encoding, which can abstract leaf nodes and emphasize the regularity of every data records. With the dynamic encoding, we reduce the numberof the repeated pattern discovered. Finally, we propose the idea of landmark, which is located in the data record, and detecting the record boundary by segmenting the DOM tree. In the experiment, we evaluate the efficiencyin our approach and compare the effectivenesswith other systems.

目錄
摘要    i
Abstract    ii
誌謝    iii
目錄    iv
圖目錄    v
表目錄    vi
一、    緒論    1
二、    相關研究    4
三、    研究方法    6
3.1    前處理    7
3.2    動態編碼    10
3.3    記錄範圍偵測    12
3.3.1 地標偵測    12
3.3.2 重複樣式探勘    13
3.3.1 記錄範圍偵測演算法    14
3.4    記錄範圍修正    16
3.4.1 移除節點數過少的YCA    17
3.4.2 移除節點數不平衡的YCA    17
3.4.3 保持YCA互相獨立    17
3.4.4 找回遺漏記錄    20
四、    實驗    23
4.1    執行效率評估    24
4.2    記錄範圍偵測結果評估    26
4.3    使用地標的改善評估    28
五、    結論與未來工作    34
參考文獻    35



圖目錄
圖 1表列式網頁範例    1
圖 2系統流程圖    7
圖 3範例網頁    8
圖 4範例網頁原始碼    8
圖 5範例網頁之文件物件模型樹    8
圖 6節點合併示意圖    12
圖 7樣式範例網頁    14
圖 8樣式對應實例集合    14
圖 9 YCA與FDA    15
圖 10記錄範圍偵測演算法    16
圖 11節點數平衡度公式    17
圖 12 YCA為包含關係案例圖    18
圖 13相同YCA節點且FDA同層級案例圖    19
圖 14相同YCA節點且FDA不同層級案例圖    20
圖 15虛擬FDA    21
圖 16記錄修正範例    21
圖 17表格式範例網頁    21
圖 18排比結果範例    22
圖 19評估公式    23
圖 20執行時間與葉節點數關係評估    26
圖 21分析頁面數量與偵測成效 (TBDW)    27
圖 22分析頁面數量與偵測成效 (ViNTs)    28
圖 23流程圖比較    29
圖 24有無記錄範圍偵測比較    30
圖 25有無地標偵測比較    30
圖 26三種策略在無地標網站之評估    30
圖 27無結尾標籤</font>造成結構問題之網頁    33
圖 28多組記錄區域含有相同地標    33
圖 29巢狀架構與巢狀內容網頁    33

                                                 
表目錄
表 1範例網頁對應之出現向量    10
表 2葉節點數比例統計    24
表 3執行效率綜合比較 (TBDW)    25
表 4執行效率綜合比較 (ViNTs)    25
表 5本系統記錄區域範圍偵測結果評估    27
表 6本系統記錄範圍偵測結果評估    27
表 7本系統與其他方法比較    28
表 8地標改善評估表    31
表 9地標數量與樣式數量    31
表 10地標效果有明顯差異之網站(RPM與LD+RPM)    31
 

                                

1. A. Arasu and H. Garcia-Molina, "Extracting structured data from Web pages", Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp.337-348, San Diego, California, 2003
2. G.O. Arocena and A.O. Mendelzon, "WebOQL: restructuring documents, databases and Webs", Data Engineering, 1998. Proceedings., 14th International Conference on, 24-33, 1998.
3. L. Bing, et al., "Towards a unified solution: data record region detection and segmentation", Proceedings of the 20th ACM international conference on Information and knowledge management, pp.1265-1274, Glasgow, Scotland, UK, 2011
4. A. Carlson and C. Schafer, "Bootstrapping Information Extraction from Semi-structured Web Pages", Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, pp.195-210, Antwerp, Belgium, 2008
5. C.H. Chang, et al., "A Survey of Web Information Extraction Systems", Knowledge and Data Engineering, IEEE Transactions on, Vol 18(10), pp.1411-1428, 2006
6. C.H. Chang and S.C. Kuo, "OLERA: Semisupervised Web-Data Extraction with Visual Support", IEEE Intelligent Systems, Vol 19(6), pp.56-64, 2004
7. C.H. Chang and S.C. Lui, "IEPAD: information extraction based on pattern discovery", Proceedings of the 10th international conference on World Wide Web, pp.681-688, Hong Kong, Hong Kong, 2001
8. W.W. Cohen, et al., "A flexible learning system for wrapping tables and lists in HTML documents", Proceedings of the 11th international conference on World Wide Web, pp.232-241, Honolulu, Hawaii, USA, 2002
9. V. Crescenzi, et al., "RoadRunner: Towards Automatic Data Extraction from Large Web Sites", Proceedings of the 27th International Conference on Very Large Data Bases, pp.109-118, 2001
10. P. Gulhane, et al., "Exploiting content redundancy for web information extraction", Proc. VLDB Endow., Vol 3(1-2), pp.578-587, 2010
11. C.N. Hsu and M.T. Dung, "Generating finite-state transducers for semi-structured data extraction from the Web", Inf. Syst., Vol 23(9), pp.521-538, 1998
12. M. Kayed and C.H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(2), pp.249-263, 2010
13. B. Liu, et al., "Mining data records in Web pages", Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.601-606, Washington, D.C., 2003
14. L. Liu, et al., "XWRAP: an XML-enabled wrapper construction system for Web information sources", Data Engineering, 2000. Proceedings. 16th International Conference on, 611-621, 2000.
15. W. Liu, et al., "ViDE: A Vision-Based Approach for Deep Web Data Extraction", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(3), pp.447-460, 2010
16. A. Machanavajjhala, et al., "Collective extraction from heterogeneous web lists", Proceedings of the fourth ACM international conference on Web search and data mining, pp.445-454, Hong Kong, China, 2011
17. G. Miao, et al., "Extracting data records from the web using tag path clustering", Proceedings of the 18th international conference on World wide web, pp.981-990, Madrid, Spain, 2009
18. I. Muslea, et al., "Hierarchical Wrapper Induction for Semistructured Information Sources", Autonomous Agents and Multi-Agent Systems, Vol 4(1-2), pp.93-114, 2001
19. J. Raposo, et al., "The Wargo system: semi-automatic wrapper generation in presence of complex data access modes", Database and Expert Systems Applications, 2002. Proceedings. 13th International Workshop on, 313-317, 2002.
20. A. Sahuguet and F. Azavant, "Building intelligent web applications using lightweight wrappers", Data Knowl. Eng., Vol 36(3), pp.283-316, 2001
21. K. Simon and G. Lausen, "ViPER: augmenting automatic information extraction with visual perceptions", Proceedings of the 14th ACM international conference on Information and knowledge management, pp.381-388, Bremen, Germany, 2005
22. H.A. Sleiman and R. Corchuelo, "A Survey on Region Extractors from Web Documents", Knowledge and Data Engineering, IEEE Transactions on, Vol 25(9), pp.1960-1981, 2013
23. H.A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Knowledge-Based Systems, Vol 39(0), pp.109-123, 2013
24. S. Soderland, "Learning Information Extraction Rules for Semi-Structured and Free Text", Mach. Learn., Vol 34(1-3), pp.233-272, 1999
25. J. Wang and F.H. Lochovsky, "Data extraction and label assignment for web databases", Proceedings of the 12th international conference on World Wide Web, pp.187-196, Budapest, Hungary, 2003
26. Y. Yamada, et al., "Testbed for information extraction from deep web", Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp.346-347, New York, NY, USA, 2004
27. Y. Zhai and B. Liu, "Web data extraction based on partial tree alignment", Proceedings of the 14th international conference on World Wide Web, pp.76-85, Chiba, Japan, 2005
28. H. Zhao, et al., "Fully automatic wrapper generation for search engines", Proceedings of the 14th international conference on World Wide Web, pp.66-75, Chiba, Japan, 2005

簡易檢索 / 詳目顯示

相關論文