具線上學習功能之新型擷取程式｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃陳科 Chen-Ko Huang
論文名稱：	具線上學習功能之新型擷取程式 A Novel Wrapper with the On-Line Learning Capability
指導教授：	蘇木春 Mu-Chun Su
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
畢業學年度：	93
語文別：	中文
論文頁數：	57
中文關鍵詞：	擷取規則、包覆程式、擷取程式
外文關鍵詞：	extraction rule, ｗrapper
相關次數：	點閱：9 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

由於現今網際網路的發達，很多資訊儲存於資料庫，然後再透過網頁呈現；而網頁的編寫目前是透過共同閘道介面（Common Gateway Interface, CGI）程式產生，凡是由同一個共同閘道介面產生的網頁，均有其固定的規則。因此本論文可以使用此一規則反向地將資料一筆一筆擷取，這規則就稱為擷取規則(Extraction Rule)。使用擷取規則將網頁的資料庫反向擷取出資訊的程式，就稱為擷取程式或包覆程式（Wrapper）。包覆程式的功能在於擷取網頁的資訊來源，並將其儲存為根據使用者所定義的格式，以方便將處理過後的資料進一步整合。為顧及網際網路的資訊過於泛濫，因此設計一個可學習的資訊擷取系統自動地產生包覆程式，可以方便整合網頁資訊，並且可省除使用者太過繁複的標示。換言之，資訊擷取系統必須根據訓練網頁所要擷取的內容，產生相對的擷取規則傳至擷取程式處理。鑑於這些考量，本論文發展出一個新的方法，以訊號化為基礎，找出使用者標示範例與網頁的關連性特徵，此方法本論文稱為「以長條圖及邊界標籤為基礎之關連性係數」，用以實現整個擷取系統，可因應網頁資訊的多元性以產生擷取規則、並且具有線上學習效能的擷取程式。

Since the Internet has been very popular and prosperous, a great amount of information now is saved among the database which is accessible through webpages. At present, most webpage-editing is using Common Gateway Interface (CGI) programming; therefore, it is of some certain constant rules. Thus we can extract the information on webpage with these constant rules known as ‘Extraction Rules’. The programming basing on Extraction Rules which can extract the information on webpage is called ‘Wrapper’.
Wrapper can not only extract the information which is performed on the webpage, but it can also transform and save information into the format which the user defines. Hence, it allows us to process the information for further purpose. On considering the overwhelming scale of internet information, designing an information extraction system with learning capability can combine the information on the webpage and enable the user build up Wrapper automatically with simple template marking. In other words, the information extraction system must abstract and establish extraction rules according to the training page for wrapper. On account of these, we develop a new method based on signals called” histogram and boundary tag-based correlation coefficient.” The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. We develop the programming with On-Line Learning Capability to set up extraction rules which will be able to cope with the diverse webpage.

摘要	I
Abstract	II
誌謝	IV
目錄	V
圖目錄	VII
表目錄	IX
第一章	緒論	1
1.1	研究背景	1
1.2	研究動機	1
1.3	研究目標	2
1.4	問題分析	3
1.5	論文架構	4
第二章	相關研究	5
2.1	WIEN擷取系統	6
2.2	STALKER擷取系統	6
2.3	SoftMealy擷取系統	7
2.4	Embley擷取系統	8
2.5	結論	9
第三章	系統架構	11
3.1	整體架構	11
3.2	訓練網頁	14
3.2.1	前置處理	15
3.2.2	以長條圖及邊界標籤為基礎之關連性係數	17
3.2.3	自動校正機制(Self-Calibrating Mechanism)	27
3.2.4	新增範例機制	29
3.3	測試網頁	30
3.3.1	線上學習機制	31
3.4	屬性對應	33
3.5	單一紀錄網頁處理	36
第四章	系統介紹與實驗結果	39
4.1	系統介紹	39
4.2	實驗結果	47
4.2.1多重紀錄網頁	47
4.2.2單一紀錄網頁	51
第五章	結論與展望	53
5.1	結論	53
5.2	未來研究方向	54
參考文獻	55
圖目錄
圖 3.1	整體架構圖	14
圖 3.2	將原始網頁訊號化示意圖	16
圖 3.3	將標示範例訊號化示意圖	16
圖 3.4	「以長條圖及邊界標籤為基礎之關連性係數」流程圖	18
圖 3.5	標示範例的長條圖統計示意圖	19
圖 3.6	資訊網頁與範例之長條圖統計關連性係數示意圖	21
圖 3.7	「以長條圖及邊界標籤為基礎之關連性係數」示意圖	23
圖 3.8	Google網站查尋結果網頁	24
圖 3.9	原始網頁訊號化	24
圖 3.10	使用者標示之範例	25
圖 3.11	標示範例訊號化	25
圖 3.12	使用者標示之範例標籤順序	25
圖 3.13	原始網頁取出的資料窗	25
圖 3.14	為使用者標示範例可能忽略標示之標籤	28
圖 3.15	線上學習機制流程圖	31
圖 3.16	Springerlink網站中使用者標示的一筆範例	33
圖 3.17	使用者標示範例屬性	34
圖 3.18	單一紀錄網頁處理流程圖	37
圖 4.1	參數設定	39
圖 4.2	設定參數	40
圖 4.3	xsd檔案	41
圖 4.4	與xsd檔對應的屬性	41
圖 4.5	標示範例選項	42
圖 4.6	使用者標示第一筆紀錄當範例	43
圖 4.7	使用者標示屬性畫面	44
圖 4.8	網頁擷取中的訓練網頁選項	45
圖 4.9	訓練網頁中找到的所有資訊	45
圖 4.10	測試網頁選項	46
圖 4.11	選擇測試網頁畫面	46
圖 4.12	Springerlink網頁顯示的其中一筆紀錄	48
圖 4.13	CiteSeer其中二筆紀錄	48
圖 4.14	ACM網站中二筆紀錄	49
圖 4.15	ACM網站其中一個紀錄	49
圖 4.16	Google網站中使用者標示之範例	50
圖 4.17	Google網站中另外一型紀錄	50
圖 4.18	MSN網站中其中一筆紀錄	50
圖 4.19	yahoo拍賣網中單一紀錄網頁的資訊	51
表目錄
表 4.1	多重紀錄網站擷取率	47
表 4.2	單一紀錄網站擷取率	51

                                

[1] 呂紹誠，「網際網路半結構性資料擷取系統之設計與實作」，碩士論文，國立中央大學資訊工程學系，中壢，2001。
[2] 郭釋謙，「線上擷取規則分析」，碩士論文，國立中央大學資訊工程學系，中壢，2003。
[3] Association for Computing Machinery(ACM), http://portal.acm.org/portal.cfm
[4] R. Baumgartner, S. Flesca, and G. Gottlob, “Supervised Wrapper Generation with Lixto,” in Proceedings of VLDB Demo,2001.
[5] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,” in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.
[6] C. H. Chang and S. C. Lui, “Iepad: Information extraction based on pattern discovery,” in Proceedings of the 10th International Conference on World Wide Web, pp. 681-688, Hong-Kong, May2-6 2001.
[7] C. H. Chang and C. N. Hsu, “Automatic Extraction of Information Blocks Using PAT Trees,” in Proceedings of 1999 National Computer Symposium (NCS-1999), Tamkang University, Tamsui, Taiwan, Dec 1999.
[8] C. H. Chang, S. C. Lui, and Y. C. Wu, “Applying pattern mining to Web information extraction,” in Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong, Apr 2001.
[9] CiteSeer, http://citeseer.ist.psu.edu/
[10] Elsevier, http://sdos.ejournal.ascc.net/
[11] D. W. Embley, Y. Jiang, and Y. K. Ng, “Record-boundary discovery in web documents,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pp. 467-478, Philadelphia, PA, 1999.
[12] D. W. Embley, Y. K. Ng, and Li. Xu, “Recognizing Ontology -Applicable Multiple-Record Web Documents,” in Proceedings of the 20th International Conference on Conceptual Modeling on Lecture Notes in Computer Science, Vol. 2224, pp.555-570, London, UK, 2001
[13] Google, http://www.google.com
[14] C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-structured data,” Journal of Information Systems, Special Issue on Semi-structured Data, Vol. 23, pp. 521-537, Aug. 1998.
[15] C. N. Hsu and C. C. Chang, “Finite-state transducers for semi- structured text mining,” in Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[16] Institute of Electrical and Electronics Engineers (IEEE), http://ieeexplore.ieee.org/
[17] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper induction for information extraction,” in Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp 729-737, Japan, 1997.
[18] N. Kushmerick, “Wrapper Induction: Efficiency and expressiveness. Workshop on AI & Information Integration,” in Proceedings of AAAI-98 Workshop on Artificial Intelligence and Information Integration, AAAI Press, pp. 15-68, Menlo Park, California,1998.
[19] L. Liu, C. Pu, and W. Han, “Xwrap: An xml-enabled wrapper construction system for web information sources,” in Proceedings of ICDE, 2000.
[20] Msn, http://www.msn.com/
[21] I. Muslea, S. Minton, and C. Knoblock, “STALKER: learning extraction rules for semi-structured, Web-based information sources,” in Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California, 1998.
[22] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” in Proceedings of the 3rd International Conference on 68 Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington, 1999.
[23] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical information from semi-structured documents,” in Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pp. 250-257, VA, USA, 2000.
[24] N. Papadakis, D. N. Skoutas, K. Raftopoulos, and T. A. Varvarigou, “An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques,” IEEE/IPSJ International Symposium on Applications and the Internet (SAINT 2005), pp. 24-30, Trento, Italy, Jan. 2005.
[25] A. Sahuguet and F. Azavant, “Building light-weight wrappers for legacy web data-sources using w4f,”in Proceeding of VLDB, 1999.
[26] A. Sahuguet and F. Azavant, “Building intelligent web applications using lightweight wrappers,” Data and Knowledge Engineering, 36(3):283-316, 2001.
[27] SpringerLink, http://link.springer-ny.com/
[28] Yahoo, http://tw.yahoo.com/

簡易檢索 / 詳目顯示

相關論文