跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃執強
Chih-Chiang Huang
論文名稱: 同性質網頁資料整合之自動化研究
On-the-fly Data Integration of Homogeneous Web Data
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
畢業學年度: 93
語文別: 中文
論文頁數: 48
中文關鍵詞: 資料整合深網
外文關鍵詞: Data Integration, Deep Web
相關次數: 點閱:5下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現在由於網際網路的發達以及電子商務的盛行,使用者常常上網訂購需要的服務或物品,為了要得到最划算的服務與物品,使用者常常需要在多個網站間,做相同性質資料的比較,而目前使用者在網路上查詢資料時,所得到的查詢後資料是動態的而且是相當豐富的,使用者必須對於感興趣的資料一個一個的作分析比較,通常要完成這樣的一件事情,就必須花費使用者非常多的心力。所以必須要有一個機制,能夠將這些“深網”中屬於相同領域網站的相同性質的資料作整合,提供使用者更便利的服務。我們從這些回傳的資料中發現,這些網站中其資料屬性名稱的標示是不充足的,而這些資料卻擁有著高度相關的資訊,本篇研究論文及是利用這些高度相關的資訊,發展一套自動化作資料整合的方法,也就是在作屬性之間的對應時,不需要經過屬性名稱的標示,即可以完成資料的分析整合。又,目前在同領域同性質的網站上,因為各網站的作者不一樣,使得用來描述每一筆紀錄所使用的資料屬性也不一樣,在某一網站上使用n個屬性作描述的資訊,在另一個網站時卻是使用m個屬性來描述,這樣造成網站之間屬性的關係是群與群之間的關係,是多對多的關係,所以我們在作資料屬性的對應時,必須達到多對多的資料屬性對應,而不只是單純的一對一的對應。也就是說我們利用不同網站中查詢到相同資料以及該資料所具有的特性,發展出一套自動化的、多對多對應的資料分析整合系統,並且對於多個領域作整合的測試,其結果顯示出我們的方法可以達到相當不錯的效能。


    目錄 I 圖目錄 II 表目錄 III 一、緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究成果 7 二、相關研究 8 2.1 輪廓對應 (SCHEMA MATCHING) 8 2.1.1 LSD 11 2.1.2 Similarity Flooding 12 2.1.3 DCM 13 2.2 分類樹對應 (CATEGORY-TREE MAPPING) 14 三、系統架構 16 3.1 資料分析工具 (DATA ANALYZER) 17 3.1.1 資料型別的鑑定 18 3.1.2 找出相同的紀錄序列 20 3.2 屬性對應器 (ATTRIBUTE MATCHER) 21 3.2.1 屬性對應器的實際例子 24 3.3 對應群抉擇器 (CANDIDATE SELECTOR) 27 3.3.1 對應群抉擇器的實際例子 29 四、實驗 32 4.1 實驗評估方式 34 4.2 實驗設定與結果 35 4.2.1 各領域的對應效能 (實驗一) 36 4.2.2 資料量大小的影響 (實驗二) 41 4.2.3 屬性個數參數的影響 (實驗三) 43 五、結論 44 參考文獻 45

    1. A. Arasu and H. Garcia-Molina. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337-348, 2003
    2. R. Agrawal and R. Srikant. On integrating catalogs. In Proceedings of the 10th International Conference on World Wide Web, pp. 603-612, 2001
    3. M. K. Bergman. The Deep Web: Surfacing Hidden Value. http://www.brightplanet.com/technology/deepweb.asp, July 2001
    4. S. Castano and V. D. Antonellis. A schema analysis and reconciliation tool environment for heterogeneous databases. In Proceedings of the 1999 International Symposium on Database Engineering & Applications, pp. 53-62, 1999
    5. C. E. H. Chua, R. H. L. Chiang, and E.-P. Lim. Instance-based attribute identification in database integration. The International Journal on Very Large Data Bases, Volume 12, Issue 3, pp. 228-243, 2003
    6. S. Chakrabarti, B. E. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the Link Structure of the World Wide Web. IEEE Computer, Volume 32, Number 8, pp. 60-67, 1999
    7. K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. ACM SIGMOD Record, Volume 33, Issue 3, pp. 61-70, 2004
    8. K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Department of Computer Science, UIUC, 2003
    9. C.-H. Chang and S.-C. Kuo. OLERA: OnLine Extraction Rule Analysis for Semi-structured Documents. IEEE Intelligent Systems, Volume 19, Number 6, pp. 56-64, 2004
    10. C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pp. 681-688, 2001
    11. V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of 27th International Conference on Very Large Data Bases, pp. 109-118, 2001
    12. A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 509-520, 2001
    13. A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, A. Y. Halevy. Learning to match ontologies on the Semantic Web. The International Journal on Very Large Data Bases, Volume 12, Issue 4, pp. 303-319, 2003
    14. B. He, K. C.-C. Chang, and J. Han. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 148-157, 2004
    15. C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data. Information Systems, Volume 23, Issue 9, pp. 521-538, 1998
    16. F. Hakimpour and A. Geppert. Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based Approach. In Proceedings of the International Conference on Formal Ontology in Information Systems - Volume 2001, pp. 297-308, 2001
    17. M. A. Hernández, R. J. Miller, and L. M. Haas. Clio: a semi-automatic tool for schema mapping. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 607, 2001
    18. R. Ichise, H. Takeda and S. Honiden. Integrating Multiple Internet Directories by Instance-based Learning. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 22-28, 2003
    19. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 729-737, 1997
    20. J. Madhavan, P. A. Bernstein and E. Rahm. Generic Schema Matching with Cupid. In Proceedings of the 27th International Conference on Very Large Data Bases, pp. 49-58, 2001
    21. I. Muslea, S. Minton, and C. Knoblock. STALKER: learning extraction rules for semi-structured, Web-based information sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, pp. 74-81, 1998
    22. S. Melnik, H. Garcia-Molona, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In Proceedings of the International Conference on Data Engineering, pp. 117-128, 2002
    23. B. Magnini, L. Sera_ni, and M. Speranza. Linguistic based matching of local ontologies. In Proceedings of AAAI-02 workshop on Meaning Negotiation, 2002
    24. L. Page and S. Brin. The Anatomy of a Search Engine. The 7th International WWW Conference, 1998
    25. E. Rahm and P. A. Bernstein. A survey of approaches to automatically schema matching. The International Journal on Very Large Data Bases, Volume 10, Issue 4, pp. 334-350, 2001
    26. S. Sarawagi, S. Chakrabarti, and S. Godbole. Cross-training: learning probabilistic mappings between topics. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 177-186, 2003
    27. J. Wang and F. H. Lochovsky. Data Extraction and Label Assignment for Web Databases. In Proceedings of the 12th International Conference on World Wide Web, pp. 187-196, 2003
    28. W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95-106, 2004
    29. Z. Zhang, B. He, and K. C.-C. Chang. Understanding web query interfaces: Best effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 107-118, 2004
    30. Z. Zhang, B. He, and K. C.-C. Chang. On-the-fly constraint mapping across web query interfaces. In Proceedings of the VLDB Workshop on Information Integration on the Web, 2004
    31. D. Zhang and W. S. Lee. Web taxonomy integration through co-bootstrapping. In Proceedings of the 27th annual International Conference on Research and Development in Information Retrieval, pp. 410-417, 2004

    QR CODE
    :::