跳到主要內容

簡易檢索 / 詳目顯示

研究生: 施宗昆
Tsung-Kun Shih
論文名稱: 使用隱藏式馬可夫模型之特定網頁資訊抓取蒐集
Focused Crawling for Information Gathering Using Hidden Markov Model
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
畢業學年度: 96
語文別: 英文
論文頁數: 41
中文關鍵詞: 馬可夫鏈資訊蒐集
外文關鍵詞: HMM, Information Gathering
相關次數: 點閱:7下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今在網路上最主要的活動就是資訊的搜尋,雖然目前的搜尋引擎已經相當的好用了,但是它們仍然有些缺點需要去改進。很多人們的資訊需求是很難用關鍵字為基礎的查詢,就能得到正確的回傳結果,所以在本篇論文裡,我們建立一個名為隱藏式的馬可夫模型,來預測最有可能的網頁路徑,進而達到蒐集特定資訊的目的,而在實驗結果方面也顯示了我們的系統改善了一些搜尋引擎所面臨的一些缺點。


    Information search is the key activity for many users on the Web. Although search engines are very useful and powerful nowadays, there are also many drawbacks faced by them. Moreover, many information needs are hard to express using keyword-based queries. In this paper, we apply a method to solve composite information needs by building a Hidden Markov Model (HMM) for predicting the most likely path to the target information. We want to use the concept of the focused crawling to trace down a Web site for specific information. The experiment shows that the results is good for the admission information and the accepted papers.

    1. INTRODUCTIONS ............................................................................................. 1 2. RELATED WORK .............................................................................................. 4 2.1 GENERAL TOPIC ........................................................................................... 4 2.2 FOCUSED TOPIC ........................................................................................... 4 2.3 DEEP WEB ...................................................................................................... 8 3. SYSTEM OVERVIEW...................................................................................... 10 3.1 Hidden Markov Model Construction ............................................................. 11 3.1.1 Collecting User Browsing Sequence .................................................. 11 3.1.2 Concept Graph Construction............................................................... 12 3.1.3 The Construction of Hidden Markov Model....................................... 13 3.2 EXECUTION ................................................................................................. 17 4. EXPERIMENTS ................................................................................................ 19 5. CONCLUSIONS ................................................................................................ 33 6. REFERENCE ..................................................................................................... 34

    1. Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). WWW ''01. ACM Press, New York, NY, 96-105.
    2. Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th international Conference on World Wide Web (Honolulu, Hawaii, USA, May 07 - 11, 2002). WWW ''02. ACM Press, New York, NY, 148-159.
    3. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused Crawling Using Context Graphs. In Proceedings of the 26th international Conference on Very Large Data Bases (September 10 - 14, 2000). A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K. Whang, Eds. Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 527-534.
    4. Fontes, A. d. and Silva, F. S. 2004. SmartCrawl: a new strategy for the exploration of the hidden web. In Proceedings of the 6th Annual ACM international Workshop on Web information and Data Management (Washington DC, USA, November 12 - 13, 2004). WIDM ''04. ACM Press, New York, NY, 9-15.
    5. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, 1998.
    6. Liu, H., Milios, E., and Janssen, J. 2004. Probabilistic models for focused web crawling. In Proceedings of the 6th Annual ACM international Workshop on Web information and Data Management (Washington DC, USA, November 12 - 13, 2004). WIDM ''04. ACM Press, New York, NY,
    7. Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E. 2001. Evaluating topic-driven web crawlers. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR ''01. ACM Press, New York, NY, 241-249.
    8. M. Ester, H.-P. Kriegel, and M. Schubert. Accurate and efficient crawling for relevant websites. In Proceedings of the 30th international Conference on Very Large Data Bases (Toronto Canada, August31-September3, 2004). VLDB’04. 396-407.
    9. Najork, M. and Wiener, J. L. 2001. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). WWW ''01. ACM Press, New York, NY, 114-118.
    10. Pandey, S. and Olston, C. 2005. User-centric Web crawling. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW ''05. ACM Press, New York, NY, 401-411.
    11. Raghavan, S. and Garcia-Molina, H. 2001. Crawling the Hidden Web. In Proceedings of the 27th international Conference on Very Large Data Bases (September 11 - 14, 2001). P. M. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, Eds. Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 129-138.
    12. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, 1998.
    13. Google Soap Search API, http://code.google.com/apis/soapsearch/
    14. JAMA: A Java Matrix Package, http://math.nist.gov/javanumerics/jama/
    15. Jahmm-Hidden Markov Model: An Implementation in Java, http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/
    16. JDIC: JDesktop Integration Components, https://jdic.dev.java.net/
    17. Jeff Heaton. Programming Spiders, Bots, and Aggregators in Java. Book ISBN: 0782140408, http://www.jeffheaton.com/java/bot/
    18. K-means Clustering Tool, http://www.javaworld.com/javaworld/jw-11-2006/jw-1121-thread.html
    19. K-Nearest-Neighbor, http://ww2.cs.fsu.edu/~chap/projects/knn/
    20. LSI: Latent Semantic Indexing Tool, http://www.cs.utk.edu/~lsi/
    21. String Edit Distance, http://en.wikipedia.org/wiki/Levenshtein_distance
    22. Web Crawler, http://en.wikipedia.org/wiki/Web_crawling
    23. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/
    24. Wikipedia: http://en.wikipedia.org/wiki/Main_Page
    25. WVTool: The World Vector Tool, http://nemoz.org/joomla/index.php?option=com_content&task=view&id=43&Itemid=83

    QR CODE
    :::