跳到主要內容

簡易檢索 / 詳目顯示

研究生: 張昇暉
Sheng-Hui Chang
論文名稱: 中文文件串流之摘要擷取研究
指導教授: 林熙禎
She-Jen Lin
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 59
中文關鍵詞: 動態摘要擷取式摘要單文件摘要多文件摘要中文摘要
外文關鍵詞: Dynamic Summarization, General Summarization
相關次數: 點閱:15下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著新聞媒體的蓬勃發展,新聞的產生是一連串的文件串流,過往使用以NGD為基礎之方式,找出和標題關鍵字具高度相關性的主題關鍵字,然而此步驟由於透過Solr全文檢索系統進行查詢,需要耗費相當長的時間,而使用非監督式圖形化摘要方法,其建立文句網路之結果也不如預期,以致於品質仍有提升空間。將過去應用於英文自動摘要之技術直接使用於中文自動摘要,然而其品質與效率皆不如預期。本研究透過增加中文詞性辨別強化中文分詞結果、以TextRank為基礎之關鍵字擷取和鏈結分析法和考慮了文句位置特徵,不僅在單文件摘要得到了較好的品質,且速度也提升了許多。並以單文件摘要方法為基礎,以瀑布式架構結合文句分群進行動態多文件摘要,不但能產生隨時間演進之摘要,也能過濾文件間的冗餘訊息。


    With the rapid development of news media, and the news is a series of document stream. In the past, the production methods of news summary were based on NGD method, it found the keywords which were highly correlated to the title. However, because that method is through the Solr full text search system, it would take lots of time. In the other way, there are still a lot of improvements in quality for the unsupervised graph-based method, since the result of the sentence network is not as good as expected. Nevertheless, when used the techniques for the English summaries in Chinese summaries directly, the quality and efficiency are still not as good as expected.
    In this study, I enhance the Chinese word segmentation with increasing the Chinese part of speech recognition. In addition, I take into account the positions of the sentence through adopting the TextRank-based keyword extraction and link-analysis method. Eventually, not only it improves the quality of the single document, but also the speed is well improved.
    At last, based on the single document summary method, I use the sentence grouping in the waterfall architecture to produce the dynamic multi-document summary. It can produce the summary with the evolution of time, and also filter the redundant message in the documents.

    摘要 iv Abstract v 致謝 vi 目錄 vii 圖目錄 x 表目錄 xi 一、 緒論 1 1-1 研究背景 1 1-2 研究動機 2 1-3 研究目的 4 二、 相關研究 5 2-1 自動文件摘要 5 2-2 單文件到多文件摘要 6 2-3 文句特徵摘要方法 7 2-3-1 關鍵詞萃取 7 2-3-2 文句位置特徵 8 2-4 圖形化摘要方法 9 2-5 文句相似度 9 2-5-1 傑卡德係數 9 2-5-2 餘弦相似度 10 2-5-3 BM25 11 2-6 鏈結分析方法 12 2-6-1 Degree 13 2-6-2 Strength 13 2-6-3 K-core 14 2-6-4 Locality Index 14 2-6-5 PageRank 15 2-7 K-medoids分群法 16 2-8 中文斷詞 16 三、 系統架構 17 3-1 系統概念與流程 17 3-2 單一文件摘要 19 3-2-1 前處理流程 19 3-2-2 建立詞彙網路 20 3-2-3 建立文句網路 21 3-2-4 文句計分 22 3-3 多文件摘要 26 3-3-1 瀑布式架構 26 3-3-2 文句長度 28 3-3-3 文句分群 29 3-3-4 文句計分 30 四、 實驗設計與結果 31 4-1 資料集 31 4-1-1 單文件摘要資料集 31 4-1-2 多文件摘要資料集 31 4-2 摘要評估成果準則 32 4-2-1 單文件摘要 32 4-2-2 多文件摘要 32 4-3 實驗結果與討論 33 4-3-1 單文件摘要 33 4-3-2 多文件實驗 38 五、 結論與未來研究方向 43 5-1 結論 43 5-2 未來研究方向 43 5-3 研究限制 44 參考文獻 45 中文部分 45 英文部分 45 網路資料 47

    [1] 黃慶杰,「以文件間差異為基礎並實作中文摘要」,碩士論文,國立中央大學資訊管理研究所,2016。
    [2] 曹洋、成穎,「基於TextRank算法的單文檔自動文摘摘要」, 南京大學, 研究生畢業論文,2016。
    [3] 王蓮淨,「以主題事件追蹤為基礎之摘要擷取」,碩士論文,國立中央大學資訊管理研究所,2015。
    [4] 蘇鼎文,「探討多重記憶系統應用於遺忘因子的使用者興趣模型」,碩士論文,國立中央大學資訊管理研究所,2014。
    [5] 鄭奕駿,「離線搜尋Wikipedia 以縮減NGD 運算時間之研究」,碩士論文,國立中央大學資訊管理研究所,2012。
    [6] 郭海蓉、張暉,「增量聚類在動太多文檔摘要中的研究與應用」,中國西南科技大學研究生學位論文,2012。
    [7] 朱巧明 等主編,中文信息處理技術教程,清華大學出版社,2005。
    [8] Abuobieda, A., Salim, N. and Albaham, A.T., “Text Summarization Features Selection Method using Pseudo Genetic-based Model,” International Conference on Information Retrieval & Knowledge Management, 2012.
    [9] Antiqueira, L., Oliveria, O.N., Costa, L.F. and Nunes, M.G.V., “A complex network approach to text summarization,” Information Sciences, Vol.179, pp. 584-599, 2009.
    [10] Hu, B., Chen, Q. and Zhu, F., “LCSTS: A Large Scale Chinese Short Text Summarization Dataset”, Proceedings of Empirical Method in Natural Language Processing (EMNLP), pp.1967–1972, 2015.
    [11] Aggarwal, C. and Zhai, C., “Mining text data”, Springer New York Dordrecht Heidelberg London, vol. 4, no. 2(63), 2012.
    [12] Cilibrasi, R.L. and Vitanyi, P.M.B., “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.3, pp. 370-383, 2007.
    [13] Radev, D.R., Hovy, E., and McKeown, K., “Introduction to the special issue on summarization,” Comput. Linguist., vol. 28, no. 4, pp. 399–408, 2002.
    [14] Shimizu, N., Hagiwara, M., Ogawa, Y., Toyama, K. and Nakagawa, H., “Metric Learning for Synonym Acquisition,” pp. 793–800, 2008.
    [15] Neto, J., Santos, A., and Kaestner, C., “Document clustering and text summarization,” Proc. 4th Int. Conf, 2000.
    [16] Hagiwara, M., Ogawa, Y. and Toyama, K., “Effective Use of Indirect Dependency for Distributional Similarity,” Inf. Media Tehnol., no. 3(4), pp. 864–887, 2008.
    [17] Marujo, L., Ling, W., Ribeiro, R., Gershman, A., Carbonell, J., Matos, D.M. and Neto, J.P., “Exploring events and distributed representations of text in multi-document summarization,” Knowledge-Based Systems, Vol.94, pp. 33–42, 2016.
    [18] Mikolov, T, Chen, K., Corrado, G., Dean, J. and Dean, J., “Efficient Estimation of Word Representations in Vector Space”, Computer Science, pp. 28-36, Jan 2013.
    [19] Morris, A.G., Kasper, G.M. and Adams, D.A., “The effects and limitations of automated text condensing on reading comprehension performance”, Information Systems Research, pp. 17-35, 1992
    [20] Baxendale, P.E., “Machine-made index for technical literature-an experiment”, IBM Journal of Research and Development, pp. 354-361, 1958.
    [21] Chen, P.I. and Lin, S.J., “Word AdHoc Network: Using Google Core Distance to extract the most relevant information,” Knowledge-Based Syst., vol. 24, no. 3, pp.393–405, 2011.
    [22] Takamura, H., Yokono, H., Okumura, M., “Summarizing a Document Stream.” In: Clough P. et al. (eds) Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg, 2011.
    [23] 中研院,中文斷詞系統CKIP,檢自:http://ckipsvr.iis.sinica.edu.tw
    [24] 唐鳳,萌典,取自: https://www.moedict.tw/about.html
    [25] FUKUBALL,結巴分詞系統,取自:https://github.com/fxsjy/jieba
    [26] TVBS新聞網「法國總統議題包」,取自:http://news.tvbs.com.tw/pack/packdetail/108

    QR CODE
    :::