| 研究生: |
馮智詮 Zhi-Quan Feng |
|---|---|
| 論文名稱: |
基於生成資料集和進一步預訓練之百科問答系統 Retrieval-based Question-Answering System based on Generated Dataset and Further Pretraining |
| 指導教授: |
王家慶
Jia-Ching Wang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 深度學習 、自然語言處理 、文本檢索 、閱讀理解 、問答系統 |
| 外文關鍵詞: | Deep Learning, Natural Language Processing, Document Retrieval, Muchine Reading Comprehension, Question Answering System |
| 相關次數: | 點閱:19 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年隨著自然語言處理領域的快速發展和進步,基於Transformer[1]的神經語言模型逐漸被開發出了各種各樣的預訓練演算法以及與之伴隨的資料集和優秀訓練結果如早期的BERT[2]、RoBERTa[3],和後來的DPR[4]等等。在檢索式開放領域問答的雙塔模型文本檢索器,以及文本閱讀理解下游任務的神經語言模型。在過去幾年,此類系統有相當多的實作和改進,但本研究所涉及之中文問答領域,往往存在一個問題,就是在雙塔模型的訓練以及文本閱讀器的訓練方面,缺少與檢索任務高度匹配且資料量較大的開放資料集,類似英文的PAQ[5]資料集,因此,本研究主要通過生成模型生成的方式,以開源中文預訓練新聞預料為基礎,獲得大規模文本-問題資料集,並通過此資料集,強化系統的文本檢索能力以及模型的閱讀理解能力,具體地,本系統分為三個主要部分。
第一部分在於收集資料,本研究使用MT5[6]預訓練模型生成所需資料集QNews,並也同時對生成資料集實行資料清洗,篩選出較為合理的問題和長度合適的文本。第二部分在於使用QNews資料集中的文本-問題對,對雙塔模型實行領域相吻合的檢索預訓練,提升雙塔模型的檢索效能。第三部分主要通過經長度採樣的QNews資料集,對文本閱讀器進行進一步預訓練,並通過一定的約束,讓模型的參數變動控制在一定範圍。
通過上述三個主要步驟,本研究意在傳統傳統檢索式開放領域百科問答系統中,一定程度地改善雙塔模型預訓練任務和下游任務的資料形式偏差,並提高神經語言模型在閱讀理解下游任務中的運行效能。
In recent years, with the rapid development and advancement in the field of natural language processing, various pretraining algorithms based on Transformer-based[1] neural language models have been developed, along with accompanying datasets and outstanding training results such as early models like BERT[2], RoBERTa[3], and later models like DPR[4]. These include DSSM document retrievers for retrieval-based open-domain question answering and neural language models for text reading comprehension downstream tasks. Over the past few years, there have been numerous implementations and improvements in such systems. However, in the Chinese question answering domain, there is often a lack of large-scale open datasets that are highly matched to retrieval tasks for training DSSM models and reading comprehension models, similar to the English PAQ[5] dataset. Therefore, this study primarily focuses on generating a large-scale text-question dataset based on open-source Chinese pretraining news corpus through a generative model. Through this dataset, the system's text retrieval capability and the model's reading comprehension ability are strengthened. Specifically, this system consists of three main parts.
The first part involves data collection. In this study, the MT5[6] pretraining model is used to generate the required dataset called QNews, and the generated dataset is also subject to data cleaning to filter out reasonable questions and texts of appropriate length.
The second part involves domain-matched retrieval pretraining of the DSSM model using the text-question pairs from the QNews dataset to enhance the retrieval performance of the DSSM.
The third part focuses on further pretraining the reading comprehension model using the length-sampled QNews dataset and controlling the variation of model parameters within a certain range through certain constraints.
Through the aforementioned three main steps, this study aims to improve the data format bias in traditional retrieval-based open-domain question answering systems to a certain extent and enhance the performance of neural language models in reading comprehension downstream tasks.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[3] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
[4] Karpukhin, Vladimir, et al. "Dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2004.04906 (2020).
[5] Lewis, Patrick, et al. "Paq: 65 million probably-asked questions and what you can do with them." Transactions of the Association for Computational Linguistics 9 (2021): 1098-1115.
[6] Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020).
[7] Oğuz, Barlas, et al. "Domain-matched pre-training tasks for dense retrieval." arXiv preprint arXiv:2107.13602 (2021).
[8] Martineau, Justin, and Tim Finin. "Delta tfidf: An improved feature space for sentiment analysis." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 3. No. 1. 2009.
[9] Practical BM25 - Part 2: The BM25 Algorithm and its Variables, https://www.elastic.co/cn/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
[10] Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
[11] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
[12] Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
[13] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
[14] Cui, Yiming, et al. "Revisiting pre-trained models for Chinese natural language processing." arXiv preprint arXiv:2004.13922 (2020).
[15] Cui, Yiming, Ziqing Yang, and Ting Liu. "PERT: pre-training BERT with permuted language model." arXiv preprint arXiv:2203.06906 (2022).
[16] Cui, Yiming, et al. "LERT: A Linguistically-motivated Pre-trained Language Model." arXiv preprint arXiv:2211.05344 (2022).
[17] Guo, Zhenliang, et al. "CNA: A Dataset for Parsing Discourse Structure on Chinese News Articles." 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2022.
[18] News2016zh: https://opendatalab.com/News2016zh
[19] Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. "Recurrent neural network regularization." arXiv preprint arXiv:1409.2329 (2014).
[20] Shi, Xingjian, et al. "Convolutional LSTM network: A machine learning approach for precipitation nowcasting." Advances in neural information processing systems 28 (2015).
[21] Cui, Yiming, et al. "A span-extraction dataset for Chinese machine reading comprehension." arXiv preprint arXiv:1810.07366 (2018).
[22] Shao, Chih Chieh, et al. "DRCD: A Chinese machine reading comprehension dataset." arXiv preprint arXiv:1806.00920 (2018).
[23] Li, Peng, et al. "Dataset and neural recurrent sequence labeling model for open-domain factoid question answering." arXiv preprint arXiv:1607.06275 (2016).
[24] CAIL2019: https://github.com/china-ai-law-challenge/CAIL2019
[25] Chinese Squad: https://github.com/junzeng-pluto/ChineseSquad
[26] Rajpurkar, Pranav, et al. "Squad: 100,000+ questions for machine comprehension of text." arXiv preprint arXiv:1606.05250 (2016).
[27] ChatGPT: https://openai.com/blog/chatgpt
[28] Che, W., Feng, Y., Qin, L., & Liu, T. (2020). N-LTP: An open-source neural language technology platform for Chinese. arXiv preprint arXiv:2009.11616.
[29] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90.
[30] Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
[31] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[32] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
[33] Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., & Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304, 114135.
[34] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
[35] Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., ... & Hon, H. W. (2019). Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.