跳到主要內容

簡易檢索 / 詳目顯示

研究生: 彭綉雯
Hsiu-Wen Peng
論文名稱: 基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發
Schema Mining and Information Extraction for PDF Documents
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系在職專班
Executive Master of Computer Science & Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 53
中文關鍵詞: 序列模式挖掘上下文學習線上學習大型語言模型
外文關鍵詞: Sequential pattern mining, In-context Learning, Online Learning, Large Language Model
相關次數: 點閱:17下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 網路上充斥著大量以 PDF 儲存的資訊,例如裁判書、財務報告、入學簡章等。對於許多應用服務而言,往往需要將其轉成結構化格式以方便後續的應用。一般說來,我們需要以人工的方式進行資料結構的定義,並依據定義好的資料結構進行資料擷取,進而訓練模型,這是十分消耗人力及時間成本的,因此如何有效率的定義資料結構,且準確的擷取資料,將是本文研究的主要課題。

    本文結合資料探勘與資料擷取兩個任務,開發了一套互動式的線上學習資料擷取系統。前者透過 PrefixSpan 的技術可以幫助使用者找出目標文件的Pattern,讓使用者能有效率的定義目標文件的資料結構;後者則是採用傳統機器學習的有限狀態傳感機 (Finite-state transducer, FST),系統可以透過少量的標記資料,依據資料結構的定義來學習提取規則,並經由這些提取規則完成資料擷取任務。

    由於資料探勘時會挖掘出過多 Pattern,因此我們透過排除項目(如:去除文件中的頁碼或行號資訊... 等) 的判斷來減少 Pattern 數量,並對不同文件格式類型作進一步的分析。而在資料擷取的任務中,我們實作兩種 LLM 擷取方法:LangChain 及 ChatGPT-QA。實驗結果顯示 LangChain 擷取效能優於ChatGPT-QA ,平均 F1 Score 分別為 0.77 及 0.63。另外,我們也針對兩種不同標記方法:人工標記及 LangChain 標記,以評估 LangChain 是否能達到取代人工標記的目標,透過使用 FST 進行資料擷取的實驗結果呈現LangChain並不能取代人工標記,其人工標記與 LangChain 標記的平均 F1 Score 分別為0.91 及 0.70。


    The internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to efficiently define data structures and accurately extract data will be the main focus of this study.

    This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to efficiently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules.

    Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively.

    中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 一、緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1-1 動機與目標. . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1-2 貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 二、相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-1 郵件資訊. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-1-1 Mailparser . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-1-2 Parsio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-2 網頁資訊. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-2-1 Octoparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-2-2 ParseHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2-2-3 Mozenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2-2-4 Web Scraper . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2-3 PDF 檔案資訊. . . . . . . . . . . . . . . . . . . . . . . . 5 2-3-1 單一提取類型. . . . . . . . . . . . . . . . . . . . . . . . . 5 2-3-2 開發者使用套件或工具. . . . . . . . . . . . . . . . . . . . 5 2-3-3 資料擷取平台. . . . . . . . . . . . . . . . . . . . . . . . . 6 三、PDFEX 系統架構. . . . . . . . . . . . . . . . . . . . . . . 8 3-1 設計理念. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3-2 系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3-3 模組1:資料結構探勘(Schema Mining) . . . . . . . . . . 9 3-4 模組2:資料擷取(Text Extraction) . . . . . . . . . . . . 11 3-5 Rule Generalization . . . . . . . . . . . . . . . . . . . . . . 12 四、實驗討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4-1 數據集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4-2 資料結構探勘(Schema Mining) . . . . . . . . . . . . . . . 17 4-3 資料擷取(Text Extraction) . . . . . . . . . . . . . . . . 18 4-3-1 評比LLM 不同應用方法的擷取效能. . . . . . . . . . . . 18 4-3-2 評比不同標記方法進行FST 擷取效能. . . . . . . . . . . 22 4-4 評估方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4-5 擷取失敗分析. . . . . . . . . . . . . . . . . . . . . . . . . 28 五、結論與未來研究. . . . . . . . . . . . . . . . . . . . . . . . 32 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 附錄一. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A-1 數據集範例. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A-2 使用其他平台測試的擷取結果. . . . . . . . . . . . . . . . 41

    [1] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction based on pattern discovery. In The Web Conference, 2001.
    [2] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–606, 2003.
    [3] Oviliani Yenty Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, 50(2):271–295, 2020.
    [4] Steven C.H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive survey. Neurocomput., 459(C):249–289, oct 2021.
    [5] LangChain. Langchain. https://python.langchain.com/docs/get_started/introduction/, 2023.
    [6] Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. Zero-shot information extraction via chatting with chatgpt, 2023.
    [7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. pages 215–224, 2001. 17th International Conference on Data Engineering ; Conference date: 02-04-2001 Through 06-04-2001.
    [8] Docparser. Docparser. https://docparser.com/blog/data-extraction-tools/, 2020.
    [9] Mailparser. Mailparser. https://mailparser.io/, 2014.
    [10] Parsio. Parsio. https://parsio.io/, 2021.
    [11] Octoparse. Octoparse. https://www.octoparse.com/, 2016.
    [12] ParseHub. Parsehub. https://www.parsehub.com/, 2015.
    [13] Mozenda. Mozenda. https://www.mozenda.com/, 2008.
    [14] Web Scraper. Web scraper. https://webscraper.io/, 2013.
    [15] Wondershare. Pdfelement. https://pdf.wondershare.net/, 2018.
    [16] Tabula. Tabula. https://tabula.technology/, 2018.
    [17] Adobe. Adobe pdf extract api. https://developer.adobe.com/document-services/apis/pdf-extract/.
    [18] Amazon. Amazon textract. https://aws.amazon.com/tw/textract/.
    [19] Nanonets. Nanonets. https://nanonets.com/, 2018.
    [20] Docparser. Docparser. https://docparser.com/.
    [21] Parseur. Parseur. https://parseur.com/, 2016.
    [22] Rossum. Rossum. https://rossum.ai/.
    [23] Docsumo. Docsumo. https://www.docsumo.com/, 2018.
    [24] Anthropic. Claude. https://claude.ai/chats, 2023.
    [25] Google. Gemini. https://gemini.google.com/, 2023.
    [26] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1162–1167, 2017.

    QR CODE
    :::