跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃雅筠
Ya-yun Huang
論文名稱: 基於已知名稱搜尋結果的網路實體辨識模型建立工具
A Tool for Web NER Model Generation Based on Search Snippets of Known Entities
指導教授: 張嘉惠
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 45
中文關鍵詞: 命名實體辨識協同訓練Tri-Training
外文關鍵詞: Named Entity Recognition, Co-Training, Tri-Training
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在過去,命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主,相對地以網路的非正式文章則著墨較少。因此,現有的辨識模組對於網頁內容的辨識效果顯得較差,當需要辨識網頁內容中的命名實體時,勢必要重新訓練辨識模組。然而,訓練一個模型的時間和人力成本非常高,包含前置的大量訓練資料準備、人工收集及標記答案,且為了提升模組辨識效果,必須要為資料做適當切割、符號統一、正規化,以及特徵值的設計、準備已知詞庫(Dictionary)等,工作非常瑣碎複雜。此外,對於不同語言或不同辨識主題則需重複上述工作。本工具的設計目的,期能解決上述命名實體辨識工作過於費力耗時的問題,經由給定已知實體名稱的搜尋結果來自動標記訓練資料,並結合Tri-training半監督式訓練來產生NER模組。實驗證實,使用本工具可以套用在不同語言及類型的命名實體辨識,在中文組織名稱辨識的效能可達到86.1%,在日文組織名稱辨識的效能可達到80.3%,在英文組織名稱辨識的效能可達到83.2%,辨識不同主題的中文地點名稱辨識效能可達到84.5%,另外,辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。


    Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles to extract person names, location names, and organization names. Since they have not been trained to deal with informal documents, the performance drops on Web documents which contain noise, and is less structured. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.

    Chinese Abstract i English Abstract ii Table of Contents iv List of Figures v List of Tables vi I. INTRODUCTION 1 1.1. Motivation 1 1.2. Thesis Organization 4 II. RELATED WORK 6 III. SYSTEM ARCHITECTURE 10 3.1. Data Collection and Automatic Labeling Modules 11 3.2. String Split and Tagging Module 14 3.3. Feature Mining Module 15 3.4. Self-Testing and Tri-Training 18 IV. EXPERIMENT 20 4.1. Data Set 21 4.2. Comparing on High-frequency Tokens Dictionary Size 24 4.3. The performance on various NER tasks 25 4.4. The Performance of Manual Generate Dictionary 27 4.5. The Performance Influence of Self-Testing and Tri-Training 29 4.6. ExactMatchLabeling and AlignmentLabeling 30 V. CONCLUSION 33 Reference 34

    [1] D.-M. Bikel, S. Miller, R. Schwartz and R. Weischedel, "Nymble: a High-Performance Learning Name-finder”, Applied natural language processing, pp. 194-201, 1997.
    [2] C.-L. Chou, C.-H. Chang, S.-Y. Wu, " Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction," Semantic Web and Information Extraction, pp. 244-255, 2014.
    [3] CRF++: Yet Another CRF toolkit, http://crfpp.googlecode.com/svn/trunk/doc/index.html 9-1541
    [4] J. Lafferty, A. McCallum and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," ICML Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, 2001.
    [5] C. Gu, X.-P. Tian, and J.-D Yu, "Automatic Recognition of Chinese Personal Name Using Conditional Random Fields and Knowledge Base," Mathematical Problems in Engineering, 2015.
    [6] Y.-Y. Lin, C.-H. Chang, "Store Name Extraction and Name-Address Matching on the Web," Proceedings of the 26th Conference on Computational Linguistics and Speech Processing, pp. 91-93, 2014.
    [7] Y. Ling, J. Yang and L. He, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
    [8] W. Li, A. McCallum, "Semi-supervised sequence modeling with syntactic topic models," AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2, pp. 813-818, 2005.
    [9] A. McCallum, W. Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons," Proceedings of the Seventh Conference on Natural Language Learning HLT-NAACL 2003 - Volume 4 (CONLL'03), pp. 188-191, 2003.
    [10] C.-W. Wu, R. T.-H. Tsai and W.-L. Hsu, "Semi-joint labeling for Chinese named entity recognition," Proceedings of the 4th Asia information retrieval conference, pp. 107-116, 2008.
    [11] X. Yao, "A Method of Chinese Organization Named Entities Recognition Based on Statistical Word Frequency, Part of Speech and Length," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
    [12] Z.-H. Zhou, M. Li, "Tri-Training: Exploiting Unlabeled Data Using Three Classifiers", IEEE Transactions on Knowledge and Data Engineering archive, Volume 17 Issue 11, November 2005, Page 152.
    [13] S. Zhang, S. Zhang and X. Wang, "Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields," Natural Language Processing and Knowledge Engineering, pp. 229-233, 2007.

    QR CODE
    :::