應用動態編碼及分治對齊算法之免標記樣版網頁完整綱要推導研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳燕琴 Oviliani Yenty Yuliana
論文名稱：	應用動態編碼及分治對齊算法之免標記樣版網頁完整綱要推導研究 Annotation-Free Induction of Full Schema from Template Web Pages with Dynamic Encoding
指導教授：	張嘉惠教授 Professor Chang, Chia-Hui
口試委員:
學位類別：	博士 Doctor
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	87
中文關鍵詞：	深度Web數據提取、劃分對齊、動態編碼、全模式歸納、多個模板頁面
外文關鍵詞：	Deep web data extraction, Divide-conquer alignment, Dynamic encoding, Full-schema induction, Multiple template pages
相關次數：	點閱：9 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

從樣版網頁中自動擷取資料是資料整合和分析的基本任務。大多數研究都集中在表列網頁的資訊擷取上。單個項目網頁的資料對齊問題（包含單個項目的詳細資訊）的處理較少，而且更具挑戰性。在第一項工作中，我們提出了一種新穎的分治對齊演算法（DCA），它可以運作在單個頁面的DOM 樹上的葉節點上。該想法是通過來自地標等價類葉節點的最長增加子序列來檢測強制模板，並遞迴地將相同的過程應用於由強制模板劃分的每個段。DCA 能夠有效地對齊每個段，並利用two-pass 過程有效地處理多階屬性與值的配對。結果表明，DCA 分別優於TEX 和 WEIR 2％和12％。在完整表格結構評估方面，改進更為明顯，在TEX 和ExAlg
的26 個網站上，得到0.95（DCA）對比0.63（TEX）F1 measure。

在第二項工作中，我們提出了一個無監督的完整表格結構網頁資訊擷取，通過帶有動態編碼的Divide-and-Conquer Alignment（DCADE）來自多個表列網頁或具有相同模板的單個頁面。我們基於葉節點內容定義內容等價類和類別等價類。然後，我們在路徑中組合HTML 屬性（id和class）以用於各種級別的編碼，以便所提出的演算法可以通過探索從特定到一般的各個層級的相似特性來對齊葉節點。我們使用TEX 和ExAlg 的49 個網站進行實驗。我們提出的DCADE 對非記錄集資訊擷取數據提取（FD）在F1 measure中達到了0.962，以及對記錄集資訊擷取（FS）在F1 measure 得到0.962，其性能優於其他頁面層級的網頁資訊擷取方法，
例如DCA（FD = 0.660），TEX（FD = 0.454 和FS = 0.549), RoadRunner（FD= 0.396 和FS = 0.330）以及UWIDE（FD = 0.260 和FS = 0.081）。

Automatic data extraction from template pages is an essential task for data integration and analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton pages, which contain detail information of a single item is less addressed and is more challenging. In the rst work, we propose a novel Divide-and-Conquer Alignment algorithm (DCA) that works on leaf nodes from the DOM trees of singleton pages. The idea is to detect mandatory templates via
the longest increasing sub-sequence from the landmark quivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. DCA able aligns each segment efficiently and handles multi-order attribute-value pairs eeffectively with a two-pass procedure. The results on selected items, DCA outperforms TEX and WEIR 2% and 12% respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F1
measure, on 26 websites from TEX and ExAlg.

In the second work, we propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE) from either multiple list pages or singleton pages with the same template. We dene the Content Equivalence Class and Typeset Equivalence Class based on leaf node content. We then combine HTML attributes (id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specic to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved
a 0.962 F1 measure for non-recordset data extraction (FD), and a 0.936 F1 measure for recordset data extraction (FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD=0.660), TEX (FD=0.454 and FS=0.549), RoadRunner (FD=0.396 and FS=0.330), and UWIDE (FD=0.260 and FS=0.081).

English Abstract i
Chinese Abstract ii
Contents iv
List of Figures vii
List of Tables ix
Introduction 1
1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . 1
2 Extracting Attribute-Value Pairs from Singleton Pages . . . . . . . . 5
3 Extracting Recordsets from Singleton and List Pages . . . . . . . . . 8
4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 10
Related Work 11
1 Input and Output of Extraction Systems . . . . . . . . . . . . . . . . 11
2 Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Automation Degrees . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Page-level data extraction Methods . . . . . . . . . . . . . . . . . . . 14
4.1 ExAlg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 RoadRunner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 FiVaTech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 TEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 WEIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 AFIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.7 UWIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iv
DCA: Divide and Conquer Alignment with Fix Encoding 18
1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . 18
2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 DCA Algorithm Overview and Denitions . . . . . . . . . . . 23
3.3 Mandatory Template (MT) Detection . . . . . . . . . . . . . . 26
3.4 Optional Template (OT) Detection . . . . . . . . . . . . . . . 28
3.5 Removing False Positive MTs . . . . . . . . . . . . . . . . . . 30
4 Alignment Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Multi-Order AV-Pair Alignment . . . . . . . . . . . . . . . . . 31
4.2 Merging Disjunctive/ Similar Columns . . . . . . . . . . . . . 33
5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 35
5.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 38
DCADE: Divide and Conquer Alignment with Dynamic Encoding 40
1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . 40
2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Data Preprocessing and Encoding Scheme . . . . . . . . . . . 44
3.2 Divide-and-Conquer Alignment . . . . . . . . . . . . . . . . . 46
3.2.1 Mandatory Template Mining in TableL . . . . . . . . 47
3.2.2 Pattern Mining in Segments . . . . . . . . . . . . . . 47
3.2.3 Columns Re-arrangement and Table Splitting . . . . 53
3.3 Summary of Proposed Method . . . . . . . . . . . . . . . . . . 55
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 61
4.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 63
Conclusion and Future Work 65
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography 68
v
Appendix 74
                                

[1] Alarte, J., Insa, D., Silva, J. and Tamarit, S. (2015). Temex: The web template
extractor. Proceedings of the 24th International Conference on World Wide Web.
pp. 155{158.
[2] Arasu, A. and Garcia-Molina, H. (2003). Extracting structured data from web
pages. Proceedings of the 2003 ACM SIGMOD international conference on Man-
agement of data. pp. 337{348.
[3] Augenstein, I., Maynard, D. and Ciravegna, F. (2016). Distantly supervised web
relation extraction for knowledge base population. Semantic Web. 7(4), pp. 335{
349.
[4] Bing, L., Lam, W. and Gu, Y. (2011). Towards a unied solution: data record
region detection and segmentation. Proceedings of the 20th ACM international
conference on Information and knowledge management. pp. 1265{1274.
[5] Bing, L., Lam, W. and Wong, T-L. (2013). Wikipedia entity expansion and
attribute extraction from the web using semi-supervised learning. Proceedings
of the sixth ACM international conference on Web search and data mining. pp.
567{576.
[6] Bronzi, M., Crescenzi, V., Merialdo, P. and Papotti, P. (2013). Extraction and
integration of partially overlapping web sources. Proceedings of the VLDB En-
dowment. 6(10), pp. 805{816.
[7] Carlson A., Betteridge J., Wang R. C., Hruschka R. and Mitchell T. M. (2010).
Coupled semi-supervised learning for information extraction. Proceedings of the
third ACM international conference on Web search and data mining. pp. 101{
110.
[8] Chang, C.-H. and Lui, S.-C. (2001). IEPAD: information extraction based on
pattern discovery. Proceedings of the 10th international conference on World
Wide Web. pp. 681{688.
[9] Chang, C.-H., Kayed, M., Girgis, M. R. and Shaalan, K. F. (2006). A survey of
web information extraction systems. IEEE transactions on knowledge and data
engineering. 18(10), pp. 1411{1428.
68
[10] Chang, C.-H., Chen, T.-S., Chen, M.-C. and Ding, J.-L. (2016). Ecient page-
level data extraction via schema induction and verication. Pacic-Asia Confer-
ence on Knowledge Discovery and Data Mining. pp. 478{490.
[11] Chang, C.-H., Lai, Y.-K., Chou, Y.-A. and Yuliana, O. Y. (2019). MobileWebsite
Creation based on Web Data eXtraction and Reuse, will be presented at JSAI
2019.
[12] Chiticariu, L., Danilevsky, M., Ho, H., Krishnamurthy, R., Li, Y., Raghavan, S.,
Reiss, F., Vaithyanathan, S. and Zhu, H. (2016). Web Information Extraction.
Management. 13(20), pp. 20.
[13] Chu, X., He, Y., Chakrabarti, K. and Ganjam, K. (2015). Tegra: Table extraction
by global record alignment. Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data. pp. 1713{1728.
[14] Crescenzi, V. and Mecca, G. (2004). Automatic information extraction from large
websites. Journal of the ACM. 51(5), pp. 731{779.
[15] Crescenzi, V., Merialdo, P. and Qiu, D. (2013). Alfred: Crowd assisted data
extraction. Proceedings of the 22nd International Conference on World Wide
Web. pp. 297{300.
[16] Dalvi, B. B., Cohen, W. W. and Callan, J. (2012). Websets: Extracting sets
of entities from the web using unsupervised information extraction. Proceedings
of the fth ACM international conference on Web search and data mining. pp.
243{252.
[17] Dhillon, P. S., Sellamanickam, S. and Selvaraj, S. K. (2011). Semi-supervised
multi-task learning of structured prediction models for web information extrac-
tion. Proceedings of the 20th ACM international conference on Information and
knowledge management. pp. 957{966.
[18] Ferrara, E., De Meo, P., Fiumara, G. and Baumgartner, R. (2014). Web data
extraction, applications and techniques: A survey. Knowledge-based systems. 70,
pp. 301{323.
[19] Fossati, M., Dorigatti, E. and Giuliano, C. (2017). N-ary relation extraction for
simultaneous T-Box and A-Box knowledge base augmentation. Semantic Web.
Preprint, pp. 1{27.
[20] Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi,
G., Schallhart, C., Sellers, A. and Wang, C. (2012). DIADEM: domain-centric,
intelligent, automated data extraction methodology. Proceedings of the 21st In-
ternational Conference on World Wide Web. pp. 267{270.
[21] Furche T., Gottlob G., Grasso G., Schallhart C. and Sellers A. (2013). OXPath:
A language for scalable data extraction, automation, and crawling on the deep
web. The International Journal on Very Large Data Bases. 22(1), pp. 47{72.
69
[22] Gao, B. and Fan, Q. (2014). Multiple template detection based on segments.
Industrial Conference on Data Mining. pp. 24{38.
[23] Gulhane P., Madaan A., Mehta R., Ramamirtham J., Rastogi R., Satpal S.,
Sengamedu S. H., Tengli A. and Tiwari C. (2011). Web-scale information ex-
traction with vertex. Proceedings of the IEEE 27th International Conference on
Data Engineering. pp. 1209{1220.
[24] Gupta R. and Sarawagi S. (2011). Joint training for open-domain extraction on
the web: exploiting overlap when supervision is limited. Proceedings of the fourth
ACM international conference on Web search and data mining. pp. 217{226.
[25] Hao, Q., Cai, R., Pang, Y. and Zhang, L. (2011). From one tree to a forest: a
unied solution for structured web data extraction. Proceedings of the 34th inter-
national ACM SIGIR conference on Research and development in Information
Retrieval. pp. 775{784.
[26] He, B., Patel, M., Zhang, Z. and Kevin C.-C. (2007). Accessing the deep web.
Communications of the ACM. 50(5), pp. 94{101.
[27] Ibrahim, Y., Riedewald, M. and Weikum, G. (2016). Making sense of entities
and quantities in web tables. Proceedings of the 25th ACM International on
Conference on Information and Knowledge Management. pp. 1703{1712.
[28] Jimenez P. and Corchuelo R. (2016). On learning web information extraction
rules with TANGO. Information Systems Journal. 66, pp. 74{103.
[29] Jou, C. (2015). Semantics-assisted deep web query interface classication. Pro-
ceedings of the Eighth International C* Conference on Computer Science & Soft-
ware Engineering. pp. 70{78.
[30] Kayed, M. and Chang, C.-H. (2010). FiVaTech: Page-level web data extraction
from template pages. IEEE transactions on knowledge and data engineering.
22(2), pp. 249{263.
[31] Koehl, A. and Wang, H. (2012). m. Site: ecient content adaptation for mobile
devices. Proceedings of the 13th International Middleware Conference. pp. 41{60.
[32] Liu, L., Pu, C. and Han, W. (2000). XWRAP: An XML-enabled wrapper con-
struction system for web information sources. Proceedings. 16th International
Conference on Data Engineering. pp. 611{621.
[33] Liu, W., Meng, X. and Meng, W. (2010). VidE: A vision-based approach for
deep web data extraction. IEEE transactions on knowledge and data engineering.
22(3), pp. 447{460.
[34] Lu, Y., He, H., Zhao, H., Meng, W. and Yu, C. (2013). Annotating search re-
sults from web databases. IEEE transactions on knowledge and data engineering.
25(3), pp. 514{527.
70
[35] Martinez-Rodriguez, J. L., Hogan, A. and Lopez-Arevalo, I. (2018). Information
Extraction meets the Semantic Web: A Survey. Semantic Web journal. in press.
[36] Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A. and Moser, L. E. (2009).
Extracting data records from the web using tag path clustering. Proceedings of
the 18th international conference on World wide web. pp. 981{990.
[37] Needleman S. B. and Wunsch C. D. (1970). A general method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of
molecular biology. 48(3), pp. 443{453.
[38] Omari, A., Kimelfeld, B., Yahav, E. and Shoham S. (2016). Lossless separation
of web pages into layout code and data. Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. pp. 1805{
1814.
[39] Omari, A., Shoham, S. and Yahav, E. (2017). Synthesis of forgiving data extrac-
tors. Proceedings of the Tenth ACM International Conference on Web Search
and Data Mining. pp. 385{394.
[40] Ortona S., Orsi G., Furche T. and Buoncristiano M. (2016). Joint repairs for
web wrappers. Proceedings of IEEE 32nd International Conference on Data En-
gineering. pp. 1146{1157.
[41] Prokofyev, R., Luggen, M., Difallah, D. E. and Cudre-Mauroux, P. (2017).
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambigu-
ous Labels. Proceedings of the 13th International Conference on Semantic Sys-
tems. pp. 65{72.
[42] Qu J., Ouyang D., Hua W., Ye Y. and Zhou X. (2019). Discovering Correlations
between Sparse Features in Distant Supervision for Relation Extraction. Pro-
ceedings of the Twelfth ACM International Conference on Web Search and Data
Mining. pp. 726{734.
[43] Ratner A. J., Bach S. H., Ehrenberg H. R. and Re C. (2017). Snorkel: Fast
training set generation for information extraction. Proceedings of the 2017 ACM
International Conference on Management of Data. pp. 1683{1686.
[44] Sahuguet, A. and Azavant, F. (2001). Building intelligent web applications using
lightweight wrappers. Data & Knowledge Engineering. 36(3), pp. 283{316.
[45] Sarawagi, S. (2008). Information extraction. Foundations and Trends R
in
Databases. 1(3), pp. 261{377.
[46] Sarawagi, S. and Chakrabarti, S. (2014). Open-domain quantity queries on web
tables: annotation, response, and consensus models. Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery and data mining. pp.
711{720.
71
[47] Sequeda, J. F., Arenas, M. and Miranker, D. P. (2012). On directly mapping
relational databases to RDF and OWL. Proceedings of the 21st international
conference on World Wide Web. pp. 649{658.
[48] Shi S., Liu C., Shen Y., Yuan C. and Huang Y. (2015). AutoRM: An eective
approach for automatic Web data record mining. Knowledge-Based Systems. 89,
pp. 314{331.
[49] Sleiman, H. A. and Corchuelo, R. (2013). A survey on region extractors from
web documents. IEEE transactions on knowledge and data engineering. 25(9),
pp. 1960{1981.
[50] Sleiman, H. A. and Corchuelo, R. (2013). Tex: An ecient and eective unsu-
pervised web information extractor. Knowledge-Based Systems. 39, pp. 109{123.
[51] Sleiman, H. A. and Corchuelo, R. (2014). Trinity: on using trinary trees for
unsupervised web data extraction. IEEE transactions on knowledge and data
engineering. 26(6), pp. 1544{1556.
[52] Song, X., Liu, J., Cao, Y., Lin, C.-Y. and Hon, H.-W. (2010). Automatic extrac-
tion of web data records containing user-generated content. Proceedings of the
19th ACM international conference on Information and knowledge management.
pp. 39{48.
[53] Su, W., Wang, J., Lochovsky, F. H. and Liu, Y. (2012). Combining tag and value
similarity for data extraction and alignment. IEEE Transactions on knowledge
and Data Engineering. 24(7), pp. 1186{1200.
[54] Tim F., Georg G., Giovanni G., Xiaonan G., Giorgio O., Christian S. and Cheng
W. (2014). DIADEM: thousands of websites to a single database. Proceedings of
the VLDB. 7(14), pp. 1845{1856.
[55] Velloso R. P. and Dorneles, C. F. (2017). Extracting Records from theWeb Using
a Signal Processing Approach. Proceedings of the 2017 ACM on Conference on
Information and Knowledge Management. pp. 197{206.
[56] Vieira, K., da Costa Carvalho, A. L., Berlt, K., de Moura, E. S., da Silva, A.
S. and Freire, J. (2009). On nding templates on web collections. World Wide
Web. 12(2), pp. 171{211.
[57] Xie, X., Fang, Y., Zhang, Z. and Li, L. (2012). Extracting data records from web
using sux tree. Proceedings of the ACM SIGKDD Workshop on Mining Data
Semantics. pp. 12.
[58] Yuliana, O. Y. and Chang, C.-H. (2016). AFIS: Aligning detail-pages for full
schema induction. Technologies and Applications of Articial Intelligence. pp.
220{227.
72
[59] Yuliana, O. Y. and Chang, C.-H. (2018). A Novel Alignment Algorithm for Ef-
fective Web Data Extraction from Singleton-item Pages. Applied Intelligence.
48(11), pp. 4355{4370.
[60] Yuliana, O. Y. and Chang, C.-H. DCADE: Divide and Conquer Alignment with
Dynamic Encoding for Full Page Data Extraction. Applied Intelligence 2019.
[61] Zhai, Y. and Liu, B. (2006). Structured data extraction from the web based on
partial tree alignment. IEEE Transactions on Knowledge and Data Engineering.
18(12), pp. 1614{1628.
[62] Zhao C., Zhang R. and Qi J. (2018). Web page template and data separation
for better maintainability. Proceedings of International Conference on Web In-
formation Systems Engineering. pp. 439{449.

簡易檢索 / 詳目顯示

相關論文