基於生成資料集和進一步預訓練之百科問答系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	馮智詮 Zhi-Quan Feng
論文名稱：	基於生成資料集和進一步預訓練之百科問答系統 Retrieval-based Question-Answering System based on Generated Dataset and Further Pretraining
指導教授：	王家慶 Jia-Ching Wang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	55
中文關鍵詞：	深度學習、自然語言處理、文本檢索、閱讀理解、問答系統
外文關鍵詞：	Deep Learning, Natural Language Processing, Document Retrieval, Muchine Reading Comprehension, Question Answering System
相關次數：	點閱：19 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年隨著自然語言處理領域的快速發展和進步，基於Transformer[1]的神經語言模型逐漸被開發出了各種各樣的預訓練演算法以及與之伴隨的資料集和優秀訓練結果如早期的BERT[2]、RoBERTa[3]，和後來的DPR[4]等等。在檢索式開放領域問答的雙塔模型文本檢索器，以及文本閱讀理解下游任務的神經語言模型。在過去幾年，此類系統有相當多的實作和改進，但本研究所涉及之中文問答領域，往往存在一個問題，就是在雙塔模型的訓練以及文本閱讀器的訓練方面，缺少與檢索任務高度匹配且資料量較大的開放資料集，類似英文的PAQ[5]資料集，因此，本研究主要通過生成模型生成的方式，以開源中文預訓練新聞預料為基礎，獲得大規模文本-問題資料集，並通過此資料集，強化系統的文本檢索能力以及模型的閱讀理解能力，具體地，本系統分為三個主要部分。
第一部分在於收集資料，本研究使用MT5[6]預訓練模型生成所需資料集QNews，並也同時對生成資料集實行資料清洗，篩選出較為合理的問題和長度合適的文本。第二部分在於使用QNews資料集中的文本-問題對，對雙塔模型實行領域相吻合的檢索預訓練，提升雙塔模型的檢索效能。第三部分主要通過經長度採樣的QNews資料集，對文本閱讀器進行進一步預訓練，並通過一定的約束，讓模型的參數變動控制在一定範圍。
通過上述三個主要步驟，本研究意在傳統傳統檢索式開放領域百科問答系統中，一定程度地改善雙塔模型預訓練任務和下游任務的資料形式偏差，並提高神經語言模型在閱讀理解下游任務中的運行效能。

In recent years, with the rapid development and advancement in the field of natural language processing, various pretraining algorithms based on Transformer-based[1] neural language models have been developed, along with accompanying datasets and outstanding training results such as early models like BERT[2], RoBERTa[3], and later models like DPR[4]. These include DSSM document retrievers for retrieval-based open-domain question answering and neural language models for text reading comprehension downstream tasks. Over the past few years, there have been numerous implementations and improvements in such systems. However, in the Chinese question answering domain, there is often a lack of large-scale open datasets that are highly matched to retrieval tasks for training DSSM models and reading comprehension models, similar to the English PAQ[5] dataset. Therefore, this study primarily focuses on generating a large-scale text-question dataset based on open-source Chinese pretraining news corpus through a generative model. Through this dataset, the system's text retrieval capability and the model's reading comprehension ability are strengthened. Specifically, this system consists of three main parts.
The first part involves data collection. In this study, the MT5[6] pretraining model is used to generate the required dataset called QNews, and the generated dataset is also subject to data cleaning to filter out reasonable questions and texts of appropriate length.
The second part involves domain-matched retrieval pretraining of the DSSM model using the text-question pairs from the QNews dataset to enhance the retrieval performance of the DSSM.
The third part focuses on further pretraining the reading comprehension model using the length-sampled QNews dataset and controlling the variation of model parameters within a certain range through certain constraints.
Through the aforementioned three main steps, this study aims to improve the data format bias in traditional retrieval-based open-domain question answering systems to a certain extent and enhance the performance of neural language models in reading comprehension downstream tasks.

中文摘要 I
Abstract II
章節目次 IV
圖目錄 V
表目錄 VI
第一章 緒論 1
1 背景 1
2 研究動機與目的 2
3 研究方法與章節概要 3
第二章 相關文獻及文獻探討 4
1 詞頻-逆向文本頻率(Term Frequency-Inverse Document Frequency, TF-IDF)和最佳適配25(Best Match, BM25)演算法 5
2 N元語法語言模型(N-gram Language Model) 6
3 預訓練神經語言模型(Neural Language Models) 7
3.1 循環神經網路(Recurrent Neural Network, RNN)和長短期記憶(Long Short-Term Memory, LSTM) 7
3.2 ELMo 9
3.3 變壓器(Transformer) 10
3.3.1 注意力機制(Attention)與自注意力(Self-Attention)機制 10
3.3.2 多頭注意力(Multi-head Attention)機制 11
3.3.3 時間複雜度對比 12
3.3.3 預訓練(Pretraining)和微調(Finetuning) 13
3.4 BERT 14
3.5 T5與MT5 15
3.6 RoBERTa 15
3.7 XLNet 16
3.8 MacBERT, PERT, LERT 17
3.9 生成式預訓練轉換器(Generative Pretrained Transformer, GPT) 18
4 大規模檔案檢索與對比學習(Contrastive Learning) 19
4.1 雙塔模型檢索器(Deep Structure Semantic Model, DSSM) 20
4.2 密集文章檢索(Dense Passage Retrieval, DPR) 21
4.3 完型填空(Inverse Cloze Task, ICT)和主體優先選擇(Body First Selection, BFS) 22
4.4 DPR-PAQ 22
第三章 基於生成資料集與進一步預訓練之百科問答系統 (Open-Domain Question Answering System based on Generated Dataset and Further Pretraining) 24
1 基於LERT的中文問答模型 24
2 基於雙塔模型的評分搜索 25
3 QNews資料集構建 26
4 問題語境匹配預訓練雙塔模型 27
5 問題語境匹配進一步預訓練中文問答模型 28
6 基於生成資料集與進一步預訓練之百科問答系統 29
第四章 實驗結果與討論 30
1 實驗設備 30
2 資料集介紹 30
3 實驗結果與討論 33
3.1 雙塔模型不同訓練演算法之對比實驗 33
3.2 文本閱讀器之進一步預訓練實驗 34
3.3 對QNews根據長度取樣的訓練實驗 36
3.4 語言模型影藏特徵相似度計算以及其參數影響之對比實驗 37
3.5 基於生成資料集和進一步預訓練之百科問答系統 38
第五章 結論及未來方向 44
參考文獻 46
                                

[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
[4] Karpukhin, Vladimir, et al. "Dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2004.04906 (2020).
[5] Lewis, Patrick, et al. "Paq: 65 million probably-asked questions and what you can do with them." Transactions of the Association for Computational Linguistics 9 (2021): 1098-1115.
[6] Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020).
[7] Oğuz, Barlas, et al. "Domain-matched pre-training tasks for dense retrieval." arXiv preprint arXiv:2107.13602 (2021).
[8] Martineau, Justin, and Tim Finin. "Delta tfidf: An improved feature space for sentiment analysis." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 3. No. 1. 2009.
[9] Practical BM25 - Part 2: The BM25 Algorithm and its Variables, https://www.elastic.co/cn/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
[10] Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
[11] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
[12] Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
[13] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
[14] Cui, Yiming, et al. "Revisiting pre-trained models for Chinese natural language processing." arXiv preprint arXiv:2004.13922 (2020).
[15] Cui, Yiming, Ziqing Yang, and Ting Liu. "PERT: pre-training BERT with permuted language model." arXiv preprint arXiv:2203.06906 (2022).
[16] Cui, Yiming, et al. "LERT: A Linguistically-motivated Pre-trained Language Model." arXiv preprint arXiv:2211.05344 (2022).
[17] Guo, Zhenliang, et al. "CNA: A Dataset for Parsing Discourse Structure on Chinese News Articles." 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2022.
[18] News2016zh: https://opendatalab.com/News2016zh
[19] Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. "Recurrent neural network regularization." arXiv preprint arXiv:1409.2329 (2014).
[20] Shi, Xingjian, et al. "Convolutional LSTM network: A machine learning approach for precipitation nowcasting." Advances in neural information processing systems 28 (2015).
[21] Cui, Yiming, et al. "A span-extraction dataset for Chinese machine reading comprehension." arXiv preprint arXiv:1810.07366 (2018).
[22] Shao, Chih Chieh, et al. "DRCD: A Chinese machine reading comprehension dataset." arXiv preprint arXiv:1806.00920 (2018).
[23] Li, Peng, et al. "Dataset and neural recurrent sequence labeling model for open-domain factoid question answering." arXiv preprint arXiv:1607.06275 (2016).
[24] CAIL2019: https://github.com/china-ai-law-challenge/CAIL2019
[25] Chinese Squad: https://github.com/junzeng-pluto/ChineseSquad
[26] Rajpurkar, Pranav, et al. "Squad: 100,000+ questions for machine comprehension of text." arXiv preprint arXiv:1606.05250 (2016).
[27] ChatGPT: https://openai.com/blog/chatgpt
[28] Che, W., Feng, Y., Qin, L., & Liu, T. (2020). N-LTP: An open-source neural language technology platform for Chinese. arXiv preprint arXiv:2009.11616.
[29] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90.
[30] Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
[31] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[32] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
[33] Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., & Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304, 114135.
[34] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
[35] Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., ... & Hon, H. W. (2019). Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.

簡易檢索 / 詳目顯示

相關論文