| 研究生: |
蔡汶霖 Wen-Lin Tsai |
|---|---|
| 論文名稱: |
以詞向量模型增進基於遞歸神經網路之中文文字摘要系統效能 |
| 指導教授: |
林熙禎
Shi-Jen Lin |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 75 |
| 中文關鍵詞: | 詞向量 、詞嵌入 、中文摘要 、萃取式摘要 、遞歸神經網路 |
| 外文關鍵詞: | word vector, word embedding, Chinese summarization, abstractive summarization, RNN |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資訊過度膨脹的時代,人們難以在短時間內接受大量資訊,自動摘要的技術因應而生。本研究以遞歸神經網路(recurrent neural network, RNN)為基礎建立一套萃取式(abstractive)摘要系統,並以word2vec與GloVe及fastText等不同的詞向量(word embedding)模型作為遞歸神經網路之預訓練詞向量模型,藉此提升摘要系統之品質。
本研究使用來自維基百科的大規模泛用語料庫與來自LCSTS資料集的語料庫作為預訓練詞向量之語料庫,並以不同維度的多種詞向量模型搭配不同維度的遞歸神經網路交互測試實驗,發現預訓練詞向量的確有助於提升系統效能,且使用適中維度的詞向量模型搭配高維度的遞歸神經網路時能取得最佳表現。
本研究亦將系統應用於中文文章,提出泛用性高且效能優異的萃取式中文摘要系統,除了以自動化評估指標取得優於前人研究30%之水準外,本研究亦以質性分析列出從優而劣之摘要成果以供參考,最後則以臺灣地區之實際新聞文章測試並驗證系統效能。
In an era of information expansion, it is difficult for people to accept a large amount of information in a short time. That is the reason why the technology of automatic summarization was born. In this study, an abstractive text summarization system based on recurrent neural network (RNN) is established. Various pre-trained word embedding models, such as word2vec, GloVe, and fastText, are used with the RNN model to improve the quality of the summarization system.
In this study, we used two corpora to pre-train word embedding models, including a large-scale and general corpus from Wikipedia and a corpus from the LCSTS dataset. In a series of experiments, we built RNN models with different hidden units’ size and different word embedding models with their different dimensions and found that the pre-trained word embedding models conduce to improve system performances. To achieve best results, using suitable dimensions in word embedding models with larger hidden units’ size in RNN is highly recommended.
The summarization system is also applied to Chinese articles to achieve a Chinese abstractive summarization system with high versatility and high performance. Our system exceeds previous works’ results in 30%, and we also provide qualitative analyses to demonstrate our outstanding achievements. Lastly, we use Taiwan news articles to test and verify our system performance.
英文文獻
[1]. Bahdanau, D. et al. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[2]. Baxendale, P. B. (1958). Machine-made index for technical literature—an experiment. IBM Journal of research and development, 2(4), 354-361.
[3]. Bengio, Y. et al. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155.
[4]. Bojanowski, P. et al. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
[5]. BYVoid et al. (2017). OpenCC. GitHub repository. Retrieved from https://github.com/BYVoid/OpenCC
[6]. Chopra, S. et al. (2016). Abstractive sentence summarization with attentive recurrent neural networks. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
[7]. Collobert, R. et al. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug), 2493-2537.
[8]. Conroy, J. M. & O'leary, D. P. (2001). Text summarization via hidden markov models. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval.
[9]. Das, D. & Martins, A. F. (2007). A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU, 4, 192-195.
[10]. Deerwester, S. et al. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.
[11]. Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3.
[12]. Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM (JACM), 16(2), 264-285.
[13]. Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179-211.
[14]. fxsjy et al. (2018). jieba. GitHub repository. Retrieved from https://github.com/fxsjy/jieba
[15]. Graves, A. & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. International Conference on Machine Learning.
[16]. Gu, J. et al. (2016). Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
[17]. Hassan, H. et al. (2018). Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint arXiv:1803.05567.
[18]. Hinton, G. E. (1986). Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society.
[19]. Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[20]. Hu, B. et al. (2015). Lcsts: A large scale chinese short text summarization dataset. arXiv preprint arXiv:1506.05865.
[21]. Huang, E. H. et al. (2012). Improving word representations via global context and multiple word prototypes. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1.
[22]. Joulin, A. et al. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
[23]. Karpathy, A. & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition.
[24]. Klein, G. et al. (2017). Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
[25]. Kupiec, J. et al. (1995). A trainable document summarizer. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval.
[26]. Lai, S. et al. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5-14.
[27]. Lin, C.-Y. (1999). Training a selection function for extraction. Proceedings of the eighth international conference on Information and knowledge management.
[28]. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
[29]. Lin, C.-Y. & Hovy, E. (1997). Identifying topics by position. Proceedings of the fifth conference on Applied natural language processing.
[30]. Lipton, Z. C. et al. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.
[31]. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.
[32]. Mesnil, G. et al. (2013). Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. Interspeech.
[33]. Mikolov, T. et al. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[34]. Mikolov, T. et al. (2010). Recurrent neural network based language model. Eleventh Annual Conference of the International Speech Communication Association.
[35]. Mikolov, T. et al. (2013b). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.
[36]. Mnih, A. & Hinton, G. (2007). Three new graphical models for statistical language modelling. Proceedings of the 24th international conference on Machine learning.
[37]. Nallapati, R. et al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
[38]. Olah, C. (2015). Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[39]. Osborne, M. (2002). Using maximum entropy for sentence extraction. Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4.
[40]. Pennington, J. et al. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
[41]. Radev, D. R. et al. (2002). Introduction to the special issue on summarization. Computational linguistics, 28(4), 399-408.
[42]. Schuster, M. & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681.
[43]. Sutskever, I. et al. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems.
[44]. Svore, K. et al. (2007). Enhancing single-document summarization by combining RankNet and third-party sources. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL).
[45]. Turian, J. et al. (2010). Word representations: a simple and general method for semi-supervised learning. Proceedings of the 48th annual meeting of the association for computational linguistics.
[46]. Wang, P. et al. (2015a). A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding. arXiv preprint arXiv:1511.00215.
[47]. Wang, P. et al. (2015b). Word embedding for recurrent neural network based TTS synthesis. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.
[48]. Zuckerberg, M. (2017, June 27). As of this morning, the Facebook community is now officially 2 billion people! We're making progress connecting the world, and now let's bring the world closer together. It's an honor to be on this journey with you [Facebook Status Update]. Retrieved from https://www.facebook.com/zuck/posts/10103831654565331
中文文獻
[49]. 張昇暉. (2017). 中文文件串流之摘要擷取研究. (碩士論文), 國立中央大學.