跳到主要內容

簡易檢索 / 詳目顯示

研究生: 吳承儒
Cheng-Ju Wu
論文名稱: 基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例
Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 軟體工程研究所
Graduate Institute of Software Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 55
中文關鍵詞: ETL分頁預測序列標記自動化爬蟲系統
外文關鍵詞: ETL, Pagination prediction, Sequence labeling, Automated crawler system
相關次數: 點閱:20下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在傳統網頁擷取(Web Data Extraction)服務中,若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況,往往會需要透過使用者手動在網頁擷取系統上做分頁標記,因此在遇到分頁資料量龐大的網站時,使用者會耗費大量的時間在"教導機器如何切換網頁",導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題,提供了基於神經網路的序列標記方法 - PRNSM,並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊,將網頁中的分頁標記成 "PAGE"、"NEXT" 以及 "OTHER",並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1,另外我們也透過零樣本實驗展示模型在多語言的效能,在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1,最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System),建立大規模自動化資料擷取系統,在大規模活動擷取的實際應用中,我們能從從 402 個網站中自動產生出 196 個資料 API,達到接近 0.5 的 API 建立率。


    Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT'', ``PAGE'' or ``OTHER'', where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from
    402 given event source URLs.

    中文摘要 i Abstract iii 目錄 v 圖目錄 vii 表目錄 ix 一、緒論 1 二、相關研究 7 2.1 非監督式資訊擷取系統 7 2.2 市面資料擷取服務 7 2.3 分頁標籤偵測 8 2.4 序列標記 10 2.5 多語言句嵌入 11 2.6 網頁節點表示 13 三、分頁標籤偵測 15 3.1 問題定義 15 3.2 發表方法 17 3.2.1 父節點資訊 17 3.2.2 網頁屬性嵌入 17 3.2.3 文字內容嵌入 18 3.2.4 序列表示層 19 3.2.5 標記預測層 19 3.2.6 訓練目標 20 3.3 訓練分析 20 3.3.1 資料集 20 3.3.2 實驗設定 21 3.3.3 實驗結果 21 3.3.4 模型開發實驗 23 四、案例研究 - 活動活動事件擷取 29 4.1 多頁訊息分割(Multiple Message Splitting) 30 4.2 實驗研究 32 4.2.1 資料集 32 4.2.2 最終結果 32 五、結論 35 參考文獻 37

    [1] Jhong li Ding. Page-level information extraction system. Master’s thesis, National Central University, Taoyuan, Taiwan, 2015.
    [2] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction.
    Applied Intelligence, pages 1–25, July 2019.
    [3] Chou Yu An. Web data etl system with unsupervised extrac-
    tion. Master’s thesis, National Central University, Taoyuan, Taiwan, 2018.
    [4] Import.io. Import.io. https://www.import.io/product/, 2012.
    [5] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.
    [6] Tianhao Wu and Vincent Sgro. Methods and systems for automated detection of pagination, 2016. US20160103799A1.
    [7] Mikhail Korobov and Iván de Prado and Mark E. Haase. Au-
    topager: Detect and classify pagination links. https://github.
    com/TeamHG-Memex/autopager, 2016.
    [8] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs), May 2007.
    [9] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data
    records in web pages. In Proceedings of the ninth ACM SIGKDD
    international conference on Knowledge discovery and data mining, pages 601–606, New York, 2003. ACM.
    [10] Yanhong Zhai and Bing Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowl-edge and Data Engineering, 18(12):1614–1628, December 2006.
    [11] Valter Crescenzi and Giansalvatore Mecca. Automatic informa-
    tion extraction from large websites. Journal of the ACM (JACM),
    51(5):731–779, September 2004.
    [12] Arvind Arasu and Hector Garcia-Molina. Extracting structured
    data from web pages. In Proceedings of the 2003 ACM SIGMOD
    international conference on Management of data, pages 337–348,
    New York, 2003. ACM.
    [13] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction
    based on pattern discovery. In Proceedings of the 10th international
    conference on World Wide Web, pages 681–688, New York, 2001.
    ACM.
    [14] KPHB Colony.
    Previous/next page.
    https://chrome.
    google.com/webstore/detail/previous-next-page/
    fmichikmgflpgibapdhepmodjdjemmda.
    [15] Google Extension.
    nextpage. https://
    chrome.google.com/webstore/detail/nextpage/
    njgkgdihapikidfkbodalicplflciggb.
    [16] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models
    for sequence tagging, 2015. cite arxiv:1508.01991.
    [17] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-
    directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual
    Meeting of the Association for Computational Linguistics (Volume
    1: Long Papers), pages 1064–1074, Berlin, Germany, August 2016.
    Association for Computational Linguistics.
    [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
    Toutanova. BERT: Pre-training of Deep Bidirectional Transformers
    for Language Understanding. In Proceedings of the 2019 Confer-
    ence of the North American Chapter of the Association for Com-
    putational Linguistics: Human Language Technologies, Volume 1
    (Long and Short Papers), NAACL, page 4171–4186, Minneapolis,
    Minnesota, 2019. Association for Computational Linguistics.
    [19] Mikel Artetxe and Holger Schwenk. Massively multilingual sen-
    tence embeddings for zero-shot cross-lingual transfer and beyond.
    In Transactions of the Association for Computational Linguistics,
    TACL, pages 597–610, 2018.
    [20] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff.
    Web2text: Deep structured boilerplate removal. In Advances in
    Information Retrieval, ECIR, pages 167–179. Springer, 2018.
    [21] Jurek Leonhardt, Avishek Anand, and Megha Khosla. Boilerplate
    removal using a neural sequence labeling model. In Companion
    Proceedings of the Web Conference 2020, WWW ’20, page 226–229,
    New York, NY, USA, 2020. Association for Computing Machinery. [22] Amazon. Alexa global top sites. https://www.alexa.com/
    topsites.
    [23] Andrew Cantino. Selector gadget. https://github.com/cantino/
    selectorgadget.
    [24] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convo-
    lutional networks for text classification. In Proceedings of the 28th
    International Conference on Neural Information Processing Sys-
    tems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA,
    2015. MIT Press.
    [25] Puppeteer.
    Puppeteer.
    https://github.com/puppeteer/
    puppeteer.
    [26] VMWare. Rabbitmq. https://www.rabbitmq.com/.
    [27] MongoDB. Mongodb. https://www.mongodb.com/.

    QR CODE
    :::