跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃嘉偉
Jia-Wei Huang
論文名稱: 以文句網路分群架構萃取多文件摘要
指導教授: 林熙禎
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 85
中文關鍵詞: 文字探勘圖形網路分群方法多文件摘要
外文關鍵詞: Text mining, Graph-based network, Clustering method, Multi-document Summarization
相關次數: 點閱:19下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年由於資訊科技發展迅速,電子文件數量大增加,為避免讀者花費過多時間吸收文件意涵,透過在文件中萃取重要文句製作摘要可幫助讀者快速吸收。然而傳統的文件摘要萃取方法僅透過該文句是否含有重要詞彙去判斷,較無更高層級的概念,如主題等;且摘要萃取文句並未對整個新聞事件做較為全面性之陳述。本研究使用圖形化摘要方法萃取多文件摘要,為指標表示方法(Indicator representation approaches)的一種,將文件切割使用較小的片段表示,本研究採用文句表示。而利用此較小之片段建立起圖形關聯網路後使用分群與數種鏈結分析方法對節點進行評分,並將其群集權重納入評分的考量後使用被選中的節點製作摘要。
    實驗採用DUC 2002以及TAC2010之資料集測試系統效能,並以ROUGE衡量摘要品質;經實驗證明,本研究之多文件摘要方法在不同的摘要任務下品質皆具有一定程度,在DUC 2002之50字與100字多文件摘要ROUGE-1值分別可達0.2996與0.3412,與當年研討會之參賽者近似之效能,而200字多文件摘要ROUGE-1值亦有0.4559,具有中等效能;在TAC 2010之Guided Summarization之第一部份之ROUGE-1值可達0.3513,超越所有當年參賽者,而ROUGE-2值亦可達0.0707,亦有中等程度之效能。


    Information technology has developed rapidly in recent years, and the number of electronic documents has increased, too. To avoid readers spend too much time realizing the content of article, it’s useful to help them understand quickly that extracting important sentences and then making summarization. However, the traditional extracting method only judges whether the sentences contain the important terms or not, and it doesn’t use the concept of topic, either. In addition, the traditional extracting method also doesn’t focus on the whole news event to make a comprehensive explanation. This paper uses Graph-based Summarization method to extract multi-document summarization, which is a kind of Indicator representation approaches to divide document in smaller fragment, and this study uses sentence to represent it. After using smaller fragment to build Graph-based network, this paper uses clustering and many kinds of link analysis methods to score the nodes. After that, this study takes cluster weight into consideration for scoring and uses the sentence nodes to make summarization.
    The experiment uses DUC 2002 and TAC 2010 dataset, and uses ROUGE to evaluation the quality of summarization. The result shows that all the methods can reach a well level. The ROUGE-1 score of DUC 2002 50 words and 100 words can reach 0.2996 and 0.3412, it approximate to the peers in DUC 2002. The ROUGE-1 score of the first part of TAC 2010 Guided Summarization can reach 0.3513, and it’s higher than other peers. Finally, the ROUGE-2 score can reach 0.0707, it also has medium quality.

    摘要 i Abstract ii 誌謝 iii 一、 緒論 1 1-1 研究背景 1 1-2 研究動機 2 1-3 研究目的 4 1-4 論文架構 5 二、 文獻探討 6 2-1 自動文件摘要 6 2-2 Guided Summarization 8 2-3 相關文獻作法與本研究差異 9 2-4 特徵分析方法 12 2-4-1 1-gram filtering 12 2-4-2 文件內容與標題之間關聯性 14 2-4-3 Term Frequency-Inverse Sentence Frequency 14 2-4-4 文句長度之研究 14 2-5 向量相似度衡量方法 15 2-6 參與中間度分群 15 2-7 鏈結分析方法 16 2-7-1 Degree 17 2-7-2 Strength 17 2-7-3 K-Core 17 2-7-4 PageRank 17 2-7-5 Locality Index 18 2-8 波達計數法 19 三、 研究方法與系統流程 20 3-1 系統流程 20 3-2 文件前處理 21 3-2-1 1-gram filtering 21 3-2-2 關鍵字相關程度 21 3-2-3 文句轉向量 22 3-2-4 文句過濾 22 3-3 文句計分 23 3-3-1 建立文句關係網路 23 3-3-2 文句分群與群集計分 24 3-3-3 文句節點評分 26 3-4 挑選文句 27 四、 實驗設計與結果討論 28 4-1 資料集與實驗設置 28 4-1-1 DUC與TAC 28 4-1-2 使用之資料集 28 4-1-3 實驗環境 29 4-1-4 輸入文件 29 4-2 評估摘要成果準則 31 4-3 實驗流程 31 4-4 實驗數據與討論 33 4-4-1 實驗一:單一鏈結方法門檻與篩選 33 4-4-2 實驗二:整合鏈結方法門檻值 45 4-4-3 實驗三:實作Guided Summarization第一部份 56 4-4-4 實驗四:系統效能評比 57 五、 結論與未來研究方向 67 5-1 結論 67 5-2 未來研究方向 68 參考文獻 69

    中文部份
    [1] 李浩平,「運用NGD建立適用於使用者回饋資訊不足之文件過濾系統」,國立中央大學,碩士論文,民國100年。
    [2] 林文羽,「關鍵字為基礎的多主題概念飄移學習」,國立中央大學,碩士論文,民國102年。
    [3] 楊佩臻,「利用文句關係網路自動萃取文件摘要之研究」,國立中央大學,碩士論文,民國102年。
    英文部份
    [4] Aggarwal, C. C., and Zhai, C. (2012). Mining Text Data. Springer New York Dordrecht Heidelberg London.
    [5] Antiqueira, L., Jr., O. N. O., Costa, L. d. F., and Nunes, M. d. G. V. (2009). “A complex network approach to text summarization”. Information Sciences, 179, 584-599.
    [6] Bando, L. L., Scholer, F., and Turpin, A. (2010). Constructing Query-biased Summaries: a Comparison of Human and System Generated Snippets. in Proceedings of the third symposium on Information interaction in context. pp. 195-204.
    [7] Biemann, C., and Bosch, A. v. d. (2011). Structure Discovery in Natural Language. Springer Heidelberg Dordrecht London New York.
    [8] Cai, X., and Li, W. (2011). “A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously”. Information Sciences, 181, 3816–3827.
    [9] Cai, X., Li, W., Ouyang, Y., and Yan, H. (2010). Simultaneous Ranking and Clustering of Sentences: A Reinforcement Approach to Multi-Document Summarization. in Proceedings of the 23rd International Conference on Computational Linguistics. pp. 134–142.

    [10] Chen, P.-I., and Lin, S.-J. (2011). “Word AdHoc Network: Using Google Core Distance to extract the most relevant information”. Knowledge-Based Systems, 24, 393–405.
    [11] Davis, M., Joann, D. K., and Marion, D. (2012). Scientific Papers and Presentations: Navigating Scientific Communication in Today's world. Academic Press.
    [12] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). “Indexing by Latent Semantic Analysis”. Journal of the American Society for Information Science, 391-407.
    [13] Erkan, G., and Radev, D. R. (2004). “LexRank: Graph-based Lexical Centrality as Salience in Text Summarization”. Articial Intelligence Research, 22, 457-479.
    [14] Girvan, M., and Newman, M. E. (2002). Community structure in social and biological networks. in Proceedings of the National Academy of Sciences. pp. 7821-7826.
    [15] Hagiwara, M., Ogawa, Y., and Toyama, K. (2008). “Effective Use of Indirect Dependency for Distributional Similarity”. Information and Media Tehnologies, 3(4), 864-887.
    [16] Hancocks, P., and Mullen, J. (2014, May 26). Thai general warns protesters after announcing royal endorsement, CNN.com International. Retrieved from http://edition.cnn.com/2014/05/26/world/asia/thailand-coup/
    [17] Huang, A. (2008). Similarity Measures for Text Document Clustering. in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008). pp. 49-56.
    [18] Kumar, Y. J., Salim, N., and Albaham, A. A. A. T. (2014). “Multi document summarization based on news components using fuzzy cross-document relations”. Applied Soft Computing, 21, 265–279.
    [19] Lopez, C., Prince, V., and Roche, M. (2014). “How can catchy titles be generated without loss of informativeness?”. Expert Systems with Applications, 41, 1051–1062.

    [20] Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B. (2002). “SUMMAC: a text summarization evaluation”. Natural Language Engineering, 8(1), 43-68.
    [21] Mihalcea, R. (2005). Language Independent Extractive Summarization. in Proceedings of the ACL Interactive Poster and Demonstration Sessions. pp. 49-52.
    [22] Neto, J. L., Santos, A. D., Kaestner, C. A. A., and Freitas, A. A. (2000). Document Clustering and Text Summarization. in Proceedings of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000). pp. 41–55.
    [23] O'Madadhain, J., Fisher, D., Nelson, T., White, S., and Boey, Y.-B. JUNG: Java Universal Network/Graph Framework. available now at: http://jung.sourceforge.net/
    [24] Olarn, K., Hancocks, P., and Smith-Spark, L. (2014, May 25). Thailand's ex-PM Yingluck Shinawatra freed from custody, sources say, CNN.com International. Retrieved from http://edition.cnn.com/2014/05/25/world/asia/thailand-coup/
    [25] Ouyang, Y., Li, W., Zhang, R., Li, S., and Lu, Q. (2013). “A progressive sentence selection strategy for document summarization”. Information Processing and Management, 49, 213–221.
    [26] Radev, D. R., Hovy, E., and McKeown, K. (2002). “Introduction to the Special Issue on Summarization”. Computational Linguistics, 28(4), 398-408.
    [27] Salton, G., and McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.
    [28] Shimizu, N., Hagiwara, M., Ogawa, Y., Toyama, K., and Nakagawa, H. (2008). Metric Learning for Synonym Acquisition. in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). pp. 793–800.
    [29] Tombros, A., and Sanderson, M. (1998). Advantages of Query Biased Summaries in Information Retrieval. in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval pp. 2-10.
    [30] Zhang, Z., Ge, S. S., and He, H. (2012). “Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling”. Information Processing and Management, 48, 767–778.

    QR CODE
    :::