| 研究生: |
吳承儒 Cheng-Ju Wu |
|---|---|
| 論文名稱: |
基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例 Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction |
| 指導教授: |
張嘉惠
Chia-Hui Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 軟體工程研究所 Graduate Institute of Software Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | ETL 、分頁預測 、序列標記 、自動化爬蟲系統 |
| 外文關鍵詞: | ETL, Pagination prediction, Sequence labeling, Automated crawler system |
| 相關次數: | 點閱:20 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在傳統網頁擷取(Web Data Extraction)服務中,若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況,往往會需要透過使用者手動在網頁擷取系統上做分頁標記,因此在遇到分頁資料量龐大的網站時,使用者會耗費大量的時間在"教導機器如何切換網頁",導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題,提供了基於神經網路的序列標記方法 - PRNSM,並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊,將網頁中的分頁標記成 "PAGE"、"NEXT" 以及 "OTHER",並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1,另外我們也透過零樣本實驗展示模型在多語言的效能,在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1,最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System),建立大規模自動化資料擷取系統,在大規模活動擷取的實際應用中,我們能從從 402 個網站中自動產生出 196 個資料 API,達到接近 0.5 的 API 建立率。
Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT'', ``PAGE'' or ``OTHER'', where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from
402 given event source URLs.
[1] Jhong li Ding. Page-level information extraction system. Master’s thesis, National Central University, Taoyuan, Taiwan, 2015.
[2] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction.
Applied Intelligence, pages 1–25, July 2019.
[3] Chou Yu An. Web data etl system with unsupervised extrac-
tion. Master’s thesis, National Central University, Taoyuan, Taiwan, 2018.
[4] Import.io. Import.io. https://www.import.io/product/, 2012.
[5] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.
[6] Tianhao Wu and Vincent Sgro. Methods and systems for automated detection of pagination, 2016. US20160103799A1.
[7] Mikhail Korobov and Iván de Prado and Mark E. Haase. Au-
topager: Detect and classify pagination links. https://github.
com/TeamHG-Memex/autopager, 2016.
[8] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs), May 2007.
[9] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data
records in web pages. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 601–606, New York, 2003. ACM.
[10] Yanhong Zhai and Bing Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowl-edge and Data Engineering, 18(12):1614–1628, December 2006.
[11] Valter Crescenzi and Giansalvatore Mecca. Automatic informa-
tion extraction from large websites. Journal of the ACM (JACM),
51(5):731–779, September 2004.
[12] Arvind Arasu and Hector Garcia-Molina. Extracting structured
data from web pages. In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, pages 337–348,
New York, 2003. ACM.
[13] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction
based on pattern discovery. In Proceedings of the 10th international
conference on World Wide Web, pages 681–688, New York, 2001.
ACM.
[14] KPHB Colony.
Previous/next page.
https://chrome.
google.com/webstore/detail/previous-next-page/
fmichikmgflpgibapdhepmodjdjemmda.
[15] Google Extension.
nextpage. https://
chrome.google.com/webstore/detail/nextpage/
njgkgdihapikidfkbodalicplflciggb.
[16] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models
for sequence tagging, 2015. cite arxiv:1508.01991.
[17] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-
directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 1064–1074, Berlin, Germany, August 2016.
Association for Computational Linguistics.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), NAACL, page 4171–4186, Minneapolis,
Minnesota, 2019. Association for Computational Linguistics.
[19] Mikel Artetxe and Holger Schwenk. Massively multilingual sen-
tence embeddings for zero-shot cross-lingual transfer and beyond.
In Transactions of the Association for Computational Linguistics,
TACL, pages 597–610, 2018.
[20] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff.
Web2text: Deep structured boilerplate removal. In Advances in
Information Retrieval, ECIR, pages 167–179. Springer, 2018.
[21] Jurek Leonhardt, Avishek Anand, and Megha Khosla. Boilerplate
removal using a neural sequence labeling model. In Companion
Proceedings of the Web Conference 2020, WWW ’20, page 226–229,
New York, NY, USA, 2020. Association for Computing Machinery. [22] Amazon. Alexa global top sites. https://www.alexa.com/
topsites.
[23] Andrew Cantino. Selector gadget. https://github.com/cantino/
selectorgadget.
[24] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convo-
lutional networks for text classification. In Proceedings of the 28th
International Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA,
2015. MIT Press.
[25] Puppeteer.
Puppeteer.
https://github.com/puppeteer/
puppeteer.
[26] VMWare. Rabbitmq. https://www.rabbitmq.com/.
[27] MongoDB. Mongodb. https://www.mongodb.com/.