| 研究生: |
黃慶杰 Ching-Jie Huang |
|---|---|
| 論文名稱: |
以文件間差異為基礎並實作中文摘要 |
| 指導教授: |
林熙禎
She-Jen Lin |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 中文 |
| 論文頁數: | 79 |
| 中文關鍵詞: | 文件間差異 、文句位置 、擷取式摘要 、多文件摘要 、中文摘要 、主題追蹤 |
| 外文關鍵詞: | Inter-document based, Sentence position, Extractive Summarization, Multi-document summarization, Chinese summarization, topic tracking |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出以文件間差異的摘要方式實作多文件摘要,有別於單一架構實作多文件摘要,改善摘要文句來自於少數或單一子概念主題,並且避免單一主題追蹤時,摘要文句取自於非相關文件的相關文句,以非監督擷取式圖形化摘要方法實現單一與多文件摘要,方法中使用到的語義詞彙網路是依據最新的維基百科資料集,再使用單一文件摘要為基礎利用文句特徵中文句位置特性逐一挑選各文件中的第一個文句,過程中若使用不同的順序處理多文件摘要,能夠得到主題發展與主題集中的兩種概念摘要,使文件摘要能有更多不同的應用,實驗探討詞彙網路所使用的新維基百科資料集對於摘要品質的測試,發現資料集的更新並無顯著影響研究的參數值,本研究所提出的方法實作DUC 2002的英文摘要,品質與其他參賽者比較,單一文件摘要得到中間以上的排名,而多文件摘要維持在中間排名,另外中文摘要使用BBC中文網的新聞資料集,標題為能彰顯文件主題的文字,因此本研究將它視為文件的概念主題,利用概念主題與查詢主題做相似度運算探討主題追蹤效果,針對主題集中及發展性的新聞進行實作,結果發現主題集中的摘要文句多著重於主要主題上,而主題發展的摘要文句能有效的擷取出文件間子主題概念。
This study proposed a way difference from Single-layer architecture based on inter-document to implement multi-document summary. This method improved the problem that summary was composed of the sentence in single or little sub-concepts, and that summary extracted the related sentence from unrelated document while topic tracking. The system applied an unsupervised graph-based extractive summarization, and the semantic relationship between terms was dependent on latest Wikipedia dataset. Multi-document summary used the concept of sentence-position in basic feature summarization by choosing the first sentence in each single-document summary. Through the process, there were two concept summaries of topic development and focus by different sequence to extract multi-document summary. The result of the investigation the new Wikipedia dataset whether influenced the parameters was not significant, and the performance of the method this study proposed with DUC 2002 dataset comparing to other participants in the single summary was above the middle of the rank, and in the multi-document summary is in the middle of the rank. The finding of the concept summary of topic focus and development with BBC Chinese news was the summary tended to primary concept in the topic focus and to sub-concept in the topic development. The effect of the topic tracking was calculating the similarity between title of the documents, because the title was the words to demonstrate the content. After the experiment, this way could effectively identify the related document.
中文部分:
〔1〕 王蓮淨(2015),以主題事件追蹤為基礎之摘要擷取,碩士論文,國立中央大學資訊管理研究所。
〔2〕 黃嘉偉(2014),以文句網路分群架構萃取多文件摘要,碩士論文,國立中央大學資訊管理研究所。
〔3〕 楊佩臻(2013),利用文句關係網路自動萃取文件摘要之研究,碩士論文,國立中央大學資訊管理研究所。
〔4〕 鄭奕駿(2012),離線搜尋Wikipedia以縮減NGD運算時間之研究,碩士論文,國立中央大學資訊管理研究所。
英文部分:
〔5〕 Abuobieda A., Salim N., Albaham A. T., Osman A.H., Kumar Y. J. (2012), “Text Summarization Features Selection Method using Pseudo Genetic-based Model,” International Conference on Information Retrieval & Knowledge Management.
〔6〕 Antiqueira L., Jr. O. N. O., Costa, L. D. F., and Nunes, M. D. G. V. (2009), “A complex network approach to text summarization,” Information Sciences, Vol.179, pp. 584-599.
〔7〕 Barry Schwartz (2003), The Paradox of choice:Why More Is Less, HarperCollins.
〔8〕 C. Lopez, V. Prince, and M. Roche (2014), “How can catchy titles be generated without loss of informativeness? ,” Expert Syst. Appl., vol. 41, no. 4 PART 1, pp. 1051–1062, 2014.
〔9〕 Cilibrasi, R.L. and Vitanyi, P.M.B. (2007), “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.3, pp, 370-383.
〔10〕 D. R. Radev, E. Hovy, and K. McKeown (2002), “Introduction to the special issue on summarization,” Comput. Linguist., vol. 28, no. 4, pp. 399–408
〔11〕 Luís Marujo, Wang Ling, Ricardo Ribeiro, Anatole Gershman, Jaime Carbonell, David Martins de Matos, João P. Neto (2016), “Exploring events and distributed representations of text in multi-document summarization,” Knowledge-Based Systems, Vol.94, pp. 33–42
〔12〕 P. I. Chen and S. J. Lin (2011), “Word AdHoc Network: Using Google Core Distance to extract the most relevant information,” Knowledge-Based Syst., vol. 24, no. 3, pp. 393–405
〔13〕 R. Mihalcea(2005), Language independent extractive summarization, Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 49–52.
〔14〕 Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. and Napolitano, A. (2012), “An extensive comparison of feature ranking aggregation techniques in bioinformatics,”
The 13th IEEE International Conference on Information Reuse and Integration, Las Vegas, USA August 8–10, 2012.
〔15〕 Zhang, Z., Ge, S. S., and He, H. (2012), “Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling,” Information Processing and Management, Vol.48, pp.767–778.
資料庫或網頁資料:
〔16〕 FUKUBALL,結巴分詞系統,取自:https://github.com/fxsjy/jieba
〔17〕 中研院(無日期),中文斷詞系統CKIP,檢自:http://ckipsvr.iis.sinica.edu.tw
〔18〕 維基百科(2016年2月21日),字詞轉換處理,檢自:https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
〔19〕 維基百科(2016年6月25日),繁簡轉換,檢自:https://zh.wikipedia.org/wiki/繁簡轉換