跳到主要內容

簡易檢索 / 詳目顯示

研究生: 吳昱豪
Yu-Hao Wu
論文名稱: 應用多任務序列標記模型於零樣本跨語言網頁模板移除之研究
Multi-Task Neural Sequence Labeling for Zero-shot Cross-Lingual Boilerplate Removal
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 44
中文關鍵詞: 序列標記模板移除多任務學習資訊擷取
外文關鍵詞: Sequence Labeling, Boilerplate Removal, Multi-task Learning, Information extraction
相關次數: 點閱:26下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現今的網頁中通常富含了許多種類的資訊,因此移除較為不相關的資訊,例如:導覽列、橫幅、連結列表或是 Footer 的版權宣告等,這些在同一個網站中大量被其他網頁共用的網頁元件,通常是使用者較不感興趣的資訊,而這種主要內文與網頁模板混合的情形,增加了資訊檢索等應用的困難度,而從網頁中擷取主要內容或移除不重要資訊的任務被稱為「模板移除」(Boilerplate Removal),常見的作法是將網頁內容分成網頁模板 (Boilerplate) 以及主要內文 (Main Content) 這兩大類。在過去的研究方法中,大多採用大量的人為的領域知識特徵如文字相關、DOM 樹相關或者是網頁結構特徵來使用傳統機器學習進行訓練,而近期的深度學習技術則在特徵上只使用 HTML 標籤以及內文資訊,如 BoilerNet 在 CleanEval 資料集中模板及內文均達到一個令人印象深刻分數,然而我們觀察到 BoilerNet 所使用的技術只能應用在單一一種語言上,這與我們實際在網際網路上所面臨的環境並不一致。在此篇論文中,我們探索了 Tag Embedding 的可能性,我們提出了兩種基於多任務學習的框架的輔助任務來擴展現今模板移除的主流技術,使其成為一個能針對任意網頁進行模板移除的多語言模型,且不僅限於任何領域及任何語言的網頁,我們的方法在 CleanEval 上獲得了目前最高的分數,在效能評估上我們採用更能反映實際應用的 Macro F1 來進行評估,另外在跨語言的能力上也使用了 4 個不同的零樣本 (ZeroShot) 實驗進行驗證,在我們進行的每個實驗中,均顯示我們所提出的模型為目前最先進的技術。


    Web pages often include various kinds of information, thus removing irrelevant information such as navigation bar, banners, link lists and footer copyrights, these kinds of web components that are shared with many web pages in a website are usually not interested by users. The scenario that main content mix with boilerplate did increase the difficulty for Information Retrieval, the task of extracting main content or remove the irrelevant information from the web page is called "Boilerplate Removal", the common solution is to classify the web component into Boilerplate and Main Content. Several researches are based on numerous hand-crafted domain knowledge like text, DOM tree or web page structure related features and trying to use traditional machine learning techniques to solve the task. Recently, some deep learning methods tried to do this task only use Tag and Content information, like BoilerNet can achieve an impressive score for Noise and Content in CleanEval dataset, however, we observed that BoilerNet can only be used on single language web pages, that is different from the environment we faced in reality. In this paper, we proposed a multi-task learning framework to extend the existing state-of-the-art boilerplate removal model to a new multilingual model, that can deal with arbitrary web pages with no domain and languages limited, and our method achieve best score on CleanEval dataset. We also proposed Macro F1 evaluation metric for better present the real performance in boilerplate removal task, and we use 4 different ZeroShot experiments to validate the cross-lingual ability of our methods. All of the experiment results shows that the proposed multitask learning methods are the state-of-the-art in this task.

    中文摘要 i 英文摘要 ii 目錄 iii 圖目錄 iv 表目錄 v 符號說明 vi 一、緒論 1 1.1 問題描述 1 1.2 動機 2 1.3 目標 3 1.4 貢獻 3 二、文獻回顧 4 2.1 資訊擷取(Information Extraction) 4 2.1.1 表格擷取(Table Extraction) 4 2.1.2 搜尋結果擷取(Search Result Record Extraction) 6 2.1.3 模板移除(Boilerplate Removal) 6 2.2 序列標記(Sequence Labeling) 10 2.3 多任務學習(Multi-Task Learning) 11 三、模型設計 13 3.1 問題定義 13 3.2 特徵定義 14 3.3 多任務學習 16 3.4 模型架構設計 17 四、實驗 19 4.1 CleanEval - Origin Task 20 4.2 零樣本(ZeroShot) 實驗 21 4.3 延伸實驗 23 五、結論 30 參考文獻 31

    [1] Jurek Leonhardt, Avishek Anand, and Megha Khosla. Boilerplate removal using a neural sequence labeling model. In Companion Proceedings of the Web Conference 2020, WWW’20, page 226–229, New York, NY, USA, 2020. Association for Computing Machinery.

    [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

    [3] Tim Berners-Lee, Mark Fischetti, and Michael L. Dertouzos. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. Harper San Francisco, 1st edition, 1999.

    [4] Michael J. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1:538–549, 2008.

    [5] A. Pivk, P. Cimiano, York Sure-Vetter, M. Gams, Vladislav Rajkovic, and R. Studer. Transforming arbitrary tables into logical form with tartar. Data Knowl. Eng., 60:567–595, 2007.

    [6] Michael J. Cafarella, A. Halevy, Y. Zhang, D. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.

    [7] S. Balakrishnan, A. Halevy, B. Harb, Hongrae Lee, J. Madhavan, Afshin Rostamizadeh, W. Shen, K. Wilder, F. Wu, and Cong Yu. Applying webtables in practice. In CIDR, 2015.

    [8] Julian Eberius, Katrin Braunschweig, Markus Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), pages 41–50, 2015.

    [9] E. Crestan and P. Pantel. Web-scale table census and classification. In WSDM ’11, 2011.

    [10] Larissa R. Lautert, Marcelo M. Scheidt, and C. Dorneles. Web table taxonomy and formalization. SIGMOD Rec., 42:28–33, 2013.

    [11] Kyosuke Nishida, Kugatsu Sadamitsu, Ryuichiro Higashinaka, and Yoshihiro Matsuo. Understanding the semantic structures of tables with a hybrid deep neural network architecture. In AAAI, 2017.

    [12] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction based on pattern discovery. In WWW ’01, 2001.

    [13] Yanhong Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18:1614–1628, 2006.

    [14] R. Novotný, P. Vojtás, and Dusan Maruscák. Information extraction from web pages. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 3:121–124, 2009.

    [15] Gengxin Miao, J. Tatemura, W. Hsiung, Arsany Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In WWW ’09, 2009.

    [16] Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. Cleaneval: a competition for cleaning web pages. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), page 6, Marrakech, Morocco, May 2008. European Language Resources Association (ELRA).

    [17] Miroslav Spousta, M. Marek, and Pavel Pecina. Victor : the web-page cleaning tool. In The 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pages 12–17, Marrakech, Morocco, 2008. European Language Resources Association (ELRA).

    [18] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

    [19] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detection using shallow text features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, page 441–450, New York, NY, USA, 2010. Association for Computing Machinery.

    [20] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. Web2text: Deep structured boilerplate removal. In European Conference on Information Retrieval, pages 167–179, Grenoble, France, 2018. Springer, Springer.

    [21] S. Eddy. Hidden markov models. Current opinion in structural biology, 6 3:361–5, 1996.

    [22] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998.

    [23] J. N. Kapur. Maximum-entropy models in science and engineering. John Wiley & Sons, 1989.

    [24] Zhiheng Huang, W. Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. ArXiv, abs/1508.01991, 2015.

    [25] Leyang Cui and Y. Zhang. Hierarchically-refined label attention network for sequence labeling. In EMNLP/IJCNLP, 2019.

    [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.

    [27] Tao Gui, Jiacheng Ye, Qi Zhang, Z. Li, Zichu Fei, Yeyun Gong, and X. Huang. Uncertainty-aware label refinement for sequence labeling. ArXiv, abs/2012.10608, 2020.

    [28] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 1027–1035, Red Hook, NY, USA, 2016. Curran Associates Inc.

    [29] Ronan Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML ’08, 2008.

    [30] Ronan Collobert, J. Weston, L. Bottou, Michael Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, 2011.

    [31] R. Caruana. Multitask learning. In Encyclopedia of Machine Learning and Data Mining, 1998.

    [32] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV, 2014.

    [33] Jianfei Yu and Jing Jiang. Learning sentence embeddings with auxiliary tasks for cross-domain sentiment classification. In EMNLP, 2016.

    QR CODE
    :::