| 研究生: |
黃陳科 Chen-Ko Huang |
|---|---|
| 論文名稱: |
具線上學習功能之新型擷取程式 A Novel Wrapper with the On-Line Learning Capability |
| 指導教授: |
蘇木春
Mu-Chun Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 畢業學年度: | 93 |
| 語文別: | 中文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 擷取規則 、包覆程式 、擷取程式 |
| 外文關鍵詞: | extraction rule, wrapper |
| 相關次數: | 點閱:9 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於現今網際網路的發達,很多資訊儲存於資料庫,然後再透過網頁呈現;而網頁的編寫目前是透過共同閘道介面(Common Gateway Interface, CGI)程式產生,凡是由同一個共同閘道介面產生的網頁,均有其固定的規則。因此本論文可以使用此一規則反向地將資料一筆一筆擷取,這規則就稱為擷取規則(Extraction Rule)。使用擷取規則將網頁的資料庫反向擷取出資訊的程式,就稱為擷取程式或包覆程式(Wrapper)。包覆程式的功能在於擷取網頁的資訊來源,並將其儲存為根據使用者所定義的格式,以方便將處理過後的資料進一步整合。為顧及網際網路的資訊過於泛濫,因此設計一個可學習的資訊擷取系統自動地產生包覆程式,可以方便整合網頁資訊,並且可省除使用者太過繁複的標示。換言之,資訊擷取系統必須根據訓練網頁所要擷取的內容,產生相對的擷取規則傳至擷取程式處理。鑑於這些考量,本論文發展出一個新的方法,以訊號化為基礎,找出使用者標示範例與網頁的關連性特徵,此方法本論文稱為「以長條圖及邊界標籤為基礎之關連性係數」,用以實現整個擷取系統,可因應網頁資訊的多元性以產生擷取規則、並且具有線上學習效能的擷取程式。
Since the Internet has been very popular and prosperous, a great amount of information now is saved among the database which is accessible through webpages. At present, most webpage-editing is using Common Gateway Interface (CGI) programming; therefore, it is of some certain constant rules. Thus we can extract the information on webpage with these constant rules known as ‘Extraction Rules’. The programming basing on Extraction Rules which can extract the information on webpage is called ‘Wrapper’.
Wrapper can not only extract the information which is performed on the webpage, but it can also transform and save information into the format which the user defines. Hence, it allows us to process the information for further purpose. On considering the overwhelming scale of internet information, designing an information extraction system with learning capability can combine the information on the webpage and enable the user build up Wrapper automatically with simple template marking. In other words, the information extraction system must abstract and establish extraction rules according to the training page for wrapper. On account of these, we develop a new method based on signals called” histogram and boundary tag-based correlation coefficient.” The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. We develop the programming with On-Line Learning Capability to set up extraction rules which will be able to cope with the diverse webpage.
[1] 呂紹誠,「網際網路半結構性資料擷取系統之設計與實作」,碩士論文,國立中央大學資訊工程學系,中壢,2001。
[2] 郭釋謙,「線上擷取規則分析」,碩士論文,國立中央大學資訊工程學系,中壢,2003。
[3] Association for Computing Machinery(ACM), http://portal.acm.org/portal.cfm
[4] R. Baumgartner, S. Flesca, and G. Gottlob, “Supervised Wrapper Generation with Lixto,” in Proceedings of VLDB Demo,2001.
[5] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,” in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.
[6] C. H. Chang and S. C. Lui, “Iepad: Information extraction based on pattern discovery,” in Proceedings of the 10th International Conference on World Wide Web, pp. 681-688, Hong-Kong, May2-6 2001.
[7] C. H. Chang and C. N. Hsu, “Automatic Extraction of Information Blocks Using PAT Trees,” in Proceedings of 1999 National Computer Symposium (NCS-1999), Tamkang University, Tamsui, Taiwan, Dec 1999.
[8] C. H. Chang, S. C. Lui, and Y. C. Wu, “Applying pattern mining to Web information extraction,” in Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong, Apr 2001.
[9] CiteSeer, http://citeseer.ist.psu.edu/
[10] Elsevier, http://sdos.ejournal.ascc.net/
[11] D. W. Embley, Y. Jiang, and Y. K. Ng, “Record-boundary discovery in web documents,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pp. 467-478, Philadelphia, PA, 1999.
[12] D. W. Embley, Y. K. Ng, and Li. Xu, “Recognizing Ontology -Applicable Multiple-Record Web Documents,” in Proceedings of the 20th International Conference on Conceptual Modeling on Lecture Notes in Computer Science, Vol. 2224, pp.555-570, London, UK, 2001
[13] Google, http://www.google.com
[14] C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-structured data,” Journal of Information Systems, Special Issue on Semi-structured Data, Vol. 23, pp. 521-537, Aug. 1998.
[15] C. N. Hsu and C. C. Chang, “Finite-state transducers for semi- structured text mining,” in Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[16] Institute of Electrical and Electronics Engineers (IEEE), http://ieeexplore.ieee.org/
[17] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper induction for information extraction,” in Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp 729-737, Japan, 1997.
[18] N. Kushmerick, “Wrapper Induction: Efficiency and expressiveness. Workshop on AI & Information Integration,” in Proceedings of AAAI-98 Workshop on Artificial Intelligence and Information Integration, AAAI Press, pp. 15-68, Menlo Park, California,1998.
[19] L. Liu, C. Pu, and W. Han, “Xwrap: An xml-enabled wrapper construction system for web information sources,” in Proceedings of ICDE, 2000.
[20] Msn, http://www.msn.com/
[21] I. Muslea, S. Minton, and C. Knoblock, “STALKER: learning extraction rules for semi-structured, Web-based information sources,” in Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California, 1998.
[22] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” in Proceedings of the 3rd International Conference on 68 Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington, 1999.
[23] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical information from semi-structured documents,” in Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pp. 250-257, VA, USA, 2000.
[24] N. Papadakis, D. N. Skoutas, K. Raftopoulos, and T. A. Varvarigou, “An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques,” IEEE/IPSJ International Symposium on Applications and the Internet (SAINT 2005), pp. 24-30, Trento, Italy, Jan. 2005.
[25] A. Sahuguet and F. Azavant, “Building light-weight wrappers for legacy web data-sources using w4f,”in Proceeding of VLDB, 1999.
[26] A. Sahuguet and F. Azavant, “Building intelligent web applications using lightweight wrappers,” Data and Knowledge Engineering, 36(3):283-316, 2001.
[27] SpringerLink, http://link.springer-ny.com/
[28] Yahoo, http://tw.yahoo.com/