跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳志銘
Jhih-ming Chen
論文名稱: 基於多元化部落格網頁之自動化擷取部落格主要文章
Automatic Extraction of Blog Post from Diverse Blog Pages
指導教授: 張嘉惠
Chia-hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
畢業學年度: 99
語文別: 英文
論文頁數: 41
中文關鍵詞: 最大加總子序列序列標記資訊檢索部落格
外文關鍵詞: blog post extraction, sequence labeling, maximum subsequence
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,部落格為主的相關研究蓬勃發展,例如:意見檢索、情緒分析。因此,擷取部落格的主要文章即是一個不可或缺的步驟。在此篇論文中,我們將探討如何從各式各樣的部落格網頁精確且自動化的擷取部落格的主要文章。許多先前的研究著重於擷取新聞網頁的主要文章,若將其應用於部落格網頁並無顯著的效果,這是由於部落格網頁風格五花八門且文章內容包含多種格式,致使擷取部落格主文變得較為複雜。針對此問題,我們結合MSS [24] 和CETR [34] 這兩篇論文的研究並加以修改調整,提出兩個部落格主文擷取的方法。第一個方法為PTR Scoring,結合了Post-to-Tag Ratio和Maximum Scoring Subsequence,是一個非監督式演算法。第二個方法為CRF Scoring,透過Conditional Random Fields此機率模型並利用Maximum Scoring Subsequence提升擷取的準確率。實驗結果顯示CRF Scoring的F-Measure可達到91.9%,是本篇論文中準確率最高的擷取方法。本篇論文所提出之方法可應用於PDA、手機…等螢幕較小的裝置,以及提升部落格搜尋引擎的效能,並提供後續相關研究之參考與幫助。


    With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods.

    中文摘要 I Abstract II 誌謝 III Table of Contents IV List of Figures V List of Tables VI 1. Introduction 1 2. Related Work 6  2.1 Content Extraction 6  2.2 Application 10 3. Our Proposed Method 11  3.1 Unsupervised Blog Post Extraction with PTR Scoring 12   3.1.1. Post-to-Tag Ratio 13   3.1.2. Smoothing Function 15   3.1.3. Maximum Scoring Subsequence 16  3.2 Supervised Blog Post Extraction with CRF Scoring 17   3.2.1. Feature Extraction 17   3.2.2. Conditional Random Field 18   3.2.3. Applying Maximum Scoring Subsequence 20 4. Experiments 21  4.1 Experimental Setup 21  4.2 Performance Study on Unsupervised Blog Post Extraction 23  4.3 Performance Study on Supervised Blog Post Extraction 26  4.4 Discussion 28 5. Conclusion & Future Work 30 Reference  31

    [1] L. Bing, Y. Wang, Y. Zhang and H. Wang. “Primary Content Extraction with Mountain Model”, CIT, IEEE, 2008, pp. 479–484.
    [2] D. Cai, S. Yu, J. R. Wen and W. Y. Ma. “VIPS: a Vision-based Page Segmentation Algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 2003.
    [3] D. Cao and X. Liao and S. Bai. “Blog Post and Comment Extraction Using Information Quantity of Web Format”, AIRS, ACM, 2008, pp. 298–309.
    [4] S. Debnath, P. Mitra, and C. L. Giles. “Automatic extraction of informative blocks from webpages”, SAC, ACM, 2005, pp. 1722–1726.
    [5] S. Debnath, P. Mitra, and C. L. Giles. “Identifying content blocks from web documents”, ISMIS, 2005, pp. 285–293.
    [6] E. Elgersma and M. de Rijke. “Learning to Recognize Blogs: A Preliminary Exploration”, ECAL Workshop, 2006.
    [7] A. Finn, N. Kushmerick, and B. Smyth. “Fact or fiction: Content classification for digital libraries”, DELOS Workshop, 2001.
    [8] J. Gibson, B. Wellner, S. Lubar. “Adaptive Web-page Content Identification”, WIDM, ACM, 2007, pp. 105-112.
    [9] T. Gottron. “Evaluating content extraction on html documents”, ITA, 2007, pp. 123–132.
    [10] T. Gottron. “Combining content extraction heuristics: the combine system”, iiWAS, ACM, 2008, pp. 591–595.
    [11] T. Gottron. “Content code blurring: A new approach to content extraction”, DEXA, IEEE, 2008, pp. 29–33.
    [12] Y. Guo, H. Tang, L. Song, Y. Wang and G. Ding. “ECON: An Approach to Extract Content from Web News Page”, APWEB, IEEE, 2010, pp. 314–320.
    [13] S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. “Automating content extraction of html documents”, WWW, ACM, 2005, pp. 179–224.
    [14] S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. “Dom-based content extraction of html documents”, WWW, ACM, 2003, pp. 207–214.
    [15] S. Gupta, G. E. Kaiser, and S. J. Stolfo. “Extracting context to improve accuracy for html content extraction”, WWW, ACM, 2005, pp. 1114–1115.
    [16] W. Han, D. Buttler, and C. Pu. “Wrapping web data into xml”, SIGMOD, ACM, 2001, pp. 33–38.
    [17] P. Kolari, A. Java, T. Finin, T. Oates and A. Joshi. “Detecting Spam Blogs: A Machine Learning Approach”, AAAI, ACM, 2006, pp. 1351−1356.
    [18] J. Lafferty, A. McCallum, and F. Pereira. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, ICML, ACM, 2001, pp. 282–289.
    [19] J. Liu, L. Birnbaum and B. Pardo. “Categorizing Blogger’s Interests Based on Short Snippets of Blog Posts”, CIKM, ACM, 2008, pp. 1525–1526.
    [20] C. Mantratzis, M. A. Orgun, and S. Cassidy. “Separating XHTML content from navigation clutter using DOM-structure block analysis”, Hypertext, ACM, 2005, pp. 145–147.
    [21] M. Marek, P. Pecina and M. Spousta. “Web Page Cleaning with Conditional Random Fields”, WWW, vol. 5, 2007, pp. 1−8.
    [22] G. Mishne and M. de Rijke. “Deriving Wishlists from Blogs”, WWW, ACM, 2006, pp. 925–926.
    [23] I. Ounis, M. de Rijke, C. Macdonald, G. Mishne, and I. Soboroff. “Overview of the TREC-2006 Blog Track”, TREC, 2006.
    [24] J. Pasternack and D. Roth. “Extracting article text from the web with maximum subsequence segmentation”, WWW, ACM, 2009, pp. 971–980.
    [25] D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. “Quasm: a system for question answering using semi-structured data”, JCDL, ACM, 2002, pp. 46–55.
    [26] M.F. Porter. “An algorithm for suffix stripping”, Program, vol. 14, no. 3, 1980, pp. 130−137.
    [27] A. F. R. Rahman, H. Alam and R. Hartono. “Content Extraction from HTML Documents”, WDA, 2001, pp. 7–10.
    [28] W. L. Ruzzo and M. Tompa. “A Linear Time Algorithm for Finding All Maximal Scoring Subsequences”, AAAI Press, ACM, 1999, pp. 234–241.
    [29] L. Song, X. Cheng, Y. Guo, B. Wu and Y. Wang. “Blog Post Extraction Using Title Finding”, Chinese Academy of Sciences, 2009.
    [30] R. Song, H. Liu, J. R. Wen, and W. Y. Ma. “Learning Important Models for Web Page Blocks based on Layout and Content Analysis”, SIGKDD, ACM, 2004, pp. 14−23.
    [31] H. M. Wallach. “Efficient Training of Conditional Random Fields”, CLUK Research Colloquium, University of Edinburgh, 2002.
    [32] H. M. Wallach. “Conditional Random Fields: An Introduction”, Technical Report MS-CIS-04-21, Univ. of Pennsylvania, 2004.
    [33] T. Weninger and W. H. Hsu. “Text Extraction from the Web via Text-to-Tag Ratio”, iiWas, ACM, 2008, pp. 23–28.
    [34] T. Weninger, W. H. Hsu and J. Han. “CETR – Content Extraction via Tag Ratios”, WWW, ACM, 2010, pp. 971–980.
    [35] L. Yang, C. Li and M. Gu. “Extracting Content from Web Pages Using the Sliding Window”, CSA, IEEE, 2009, pp. 1–6.
    [36] P. H. Yang and C. H. Chang. “Automatic Labeling for Blog Post Extraction”, NCS, Taiwan, 2009.

    QR CODE
    :::