跳到主要內容

簡易檢索 / 詳目顯示

研究生: 胡姝涵
Shu-Han Hu
論文名稱: 會議公告網站資訊擷取之研究
Conference Information Extraction: Segmentation Base Approach
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系在職專班
Executive Master of Computer Science & Information Engineering
畢業學年度: 94
語文別: 中文
論文頁數: 50
中文關鍵詞: 資訊擷取機器學習
外文關鍵詞: Information Extraction, Machine Learning
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   隨著資訊科技的進步,網際網路的快速與便利使得我們漸漸以網頁來取代傳統以紙張為主的資料呈現方式,然而網頁呈現的豐富與多樣化,使得有效擷取有用的資訊成為一項重大的挑戰。資訊擷取(Information Extraction)的技術主要是將非結構化的資料,透過整理、篩選,加以整合成為結構化的資料,最後便可有效的擷取出有用的資訊。資訊擷取的設計,最直接的方法是針對各個網站利用人工撰寫資訊擷取的方式,架構出符合此網站的資訊擷取系統,但由於網站的格式隨時有可能發生變更,或是因應不同作者架構出的網站格式不同,我們都必須修改撰寫不同的資訊擷取程式,這是非常不經濟的。因此,如何利用自動化的方式因應不同的網站格式來擷取網頁資訊,是設計資訊擷取程式最大的目標。自動化的資訊擷取設計,就要仰賴機器學習(Machine Learning)的方式,如何讓電腦具有學習的能力,從以往的經驗學習到知識和擷取規則,使得電腦本身具有擷取正確資訊的能力。
      本篇論文主要針對國際性會議(International Conference)公告網站,擷取來自不同佈告者公告的國際會議資訊,包括會議名稱、會議地點、會議日期和論文接受日期。國際會議內容以純文字為主,加上會議內容的撰寫來自不同的佈告者且為公告性質的網站,內容多為佈告者以簡短的口語來表達並不具結構性,所以在資訊的整合與擷取上有一定的困難度,如何有效的擷取出正確的資訊,本篇論文運用機器學習的方式,讓電腦具有學習的能力,自動擷取來自不同佈告者公告的國際會議資訊,並且有不錯的效果。


    With the progress of information technologies, the traditional sheets of paper are replaced by web pages rapidly. The versatilities and abundant contents in the web pages make the extraction of useful information far more difficult than before. Information extraction technology has allowed us to extract such information from non-structural data by means of a series of processes, such as arrangement, distillation and coalition. Due to the potential changes of infra-structure of web pages and the diversities of designers’ personal styles, the most straight-forward but may not so cost effective way is to construct extraction system manually in accordance with the characteristics of individual web site. Therefore, automated extraction is the most wanted goal to achieve.
    This thesis focuses on the extraction of conference information, such as conference names, locations, dates and accept paper dates, from DB World and international conference web pages. Since the bulletin-type conference web pages are not only text-rich but also written and published orally by different individuals without any structural harmonization, it makes the processes of integration and extraction rigorously. The system which is built on machine learning techniques is creditable and validated to perform well for the extraction of specific fields from cross web site pages.

    第1章 緒論 1 1.1 研究背景與動機 1 1.2 設計概要 3 1.3 論文架構 4 第2章 相關研究與技術 5 2.1 SRV 系統 5 2.2 Rapier 系統 9 2.3 STALKER系統 13 2.4 GATE ANNIE 16 2.5 Naïve Bayes Classifier 16 2.6 SVM 17 2.7 FOIL演算法 19 第3章 設計與實作 21 3.1 會議名稱 22 3.1.1 會議名稱Segmentation(Sliding Windows)23 3.1.2 會議名稱Tokenlization 25 3.2 會議地點、會議日期和論文接受日期 26 3.2.1會議地點、會議日期和論文接受日期Tokenlization 27 3.2.2會議地點、會議日期和論文接受日期 - Contextual Rule 29 第4章 實驗與討論 31 4.1 會議名稱實驗結果 32 4.2 會議地點實驗結果 36 4.3 會議日期實驗結果 39 4.4 論文接受日期實驗結果 43 4.5 討論 45 第5章 結論與未來展望 47 參考文獻 48

    [1] Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
    [2] Dayne Freitag. Machine Learning for Information Extraction in Information Domains. Ph.D. thesis, Carnegie Mellon University, 1998.
    [3] M.E. Califf, and R.J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on AI, 328-334, 1999.
    [4] M.E. Califf, and R.J. Mooney. Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research 4 (2003) 177-210
    [5] M.E. Califf, Ph.D. Relational Learning Techniques of Natural Language Information Extraction. The University of Texas at Austin, 1998. Technical Report AI98-269
    [6] I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction. In Proceedings of 3rd International Conference on Autonomous Agents(Agents-99),pp. 190-197, Seattle, Washington,1999
    [7] Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with Finite-State Transducers and Contextual Rules. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01. 1998.
    [8] Chun-Nan Hsu. and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
    [9] C. H. Chang and S.C. Lui. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of 10th International Conference on World Wide Web, pp. 681-688, 2001.
    [10] J. Wang, and F.H. Lochovsky. Data Extraction and Label Assignment for Web Databases. In Proceedings of the twelfth international conference on Wide Web, Page 187 - 96, 2003.
    [11] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Date Mining (KDD’03), Page 24 - 27, 2003
    [12] Muggleton, S. , and Feng, C. Efficient induction of Logic Programs. In Muggleton, S., ed., Inductive Logic Programming. New York: Academic Press. 281-297, 1992.
    [13] Zelle, J. M., and Mooney, R. J. Combining Top-down and bottom-up methods in inductive logic programming. In Proceedings of the Eleventh Internatinal on Machine Learning, 343-351. 1994
    [14] Muggleton, S. Inverse entailment and Progol. New Generation Computing Journal 13:245 – 286. 1995
    [15] Developing Language Processing Components with GATE Version 3 (a User Guide) , http://gate.ac.uk/sale/tao The University of Sheffield 2001-2005
    [16] GATE – An Application Developer’s Guide http://www.dcs.shef.ac.uk/~valyt Department of Computer Science University of Sheffield, UK. 19 July 2004
    [17] Tom Kenter, Diana Maynard Using GATE as an Annotation Tool 28th January 2005
    [18] Tom M. Mitchell, carnegie Mellon University, Machine Learning
    [19] Jiawei Han, Micheline Kamber, Data Ming concepts and Techniques
    [20] Richard J. Roiger, Michael W. Geatz, Data Mining A Tutorial-Based Primer
    [21] Weka The University of Waikato http://www.cs.waikato.ac.nz/ml/weka/
    [22] Coenen, F. LUCS-KDD implementations of the FOIL, PTM and CPAR algorithms, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/,Department of
    Science, The University of Liverpool, UK. (2004)
    [23] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery, 2, pp. 121-167,1998
    [24] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin A Practical Guide to Support Vector Classification Department of Computer Science and Information Engineering NTU
    [25] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

    QR CODE
    :::