| 研究生: |
蔡融易 Jung-Yi Tsai |
|---|---|
| 論文名稱: |
主動式學習之古漢語斷詞 |
| 指導教授: |
蔡宗翰
Tzong-Han Tsai |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 軟體工程研究所 Graduate Institute of Software Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 48 |
| 中文關鍵詞: | 自然語言處理 、主動式學習 、古漢語斷詞 |
| 外文關鍵詞: | Natural Language Processing, Active Learning, Classical Chinese Word Segmentation |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目前進階的自然語言技術有事件擷取、事件分類、自動摘要等等,若是可以應用在古漢語中,對於歷史學者會有很大的幫助,但是自然語言處理應用在古漢語方面上,大部分都還在基礎的斷句、斷詞和命名實體辨識上使用監督式學習的方法去做辨識,因為古漢語標註的人員少門檻高,因此在建立監督式學習的方法的訓練資料需要花更多時間,進而影響進階的自然語言技術系統的開發,因為進階的自然語言技術所構成的基本元素就是語意詞,如果沒有準確度高的斷詞結果,都會直接影響到進階自然語言技術的準確度,因此,我們建立古漢語斷詞系統,相較於傳統,我們的系統在斷詞之前,不需要訓練資料。
現有的中文斷詞模組並不適合古漢語,文法與用詞上都相差太多,因此無法直接使用現有的中文的斷詞模組,但是訓練一個監督式學習的機器模型,又需要耗費大量時間和人力在定義與標註語意詞上,而且古漢語標註人員需要仰賴對歷史的專業度,加上對於標註沒有句讀的段落,致使人工標註時間增加,從上述幾個原因可以發現建立古漢語監督式學習的機器模型成本是很高的,因此,我們使用非監督式模型斷詞,再透過主動式學習找到可能錯誤的片段,提供給人來加以做修正,讓人工不用再去檢驗正確率高的部分,提升標註效率。
本篇論文實現了主動式學習之古漢語斷詞,並實用於【明實錄】上,我們以主動式學習取代需要大量人力標註的監督式學習,並且改善非監督式學習需要透過資料量才能增加精準度的缺點,透過主動式學習的網頁呈現出可能錯誤的片段,減少標註人員修正的次數。
Currently, advanced Natural Language Processing (NLP) includes event extraction or event classification, automatic text summarization and so on. Most NLP techniques for classical Chinese are still on the early stage, like sentence segmentation or word segmentation, named entity recognition. These basic applications usually use supervised learning to identify. Tagging the training data of these basic applications need to spend much time, because the people that know the classical Chinese are minority. Therefore, the current advanced Natural Language Processing for classical Chinese are difficult to develop. The basic element of most languages is word. The accuracy of word segmentation influences the effect of the current advanced Natural Language Processing directly. As a result, we develop the word segment system for classical Chinese. Compared with traditional word segmentation, we do not need training data.
This thesis focuses on applying active learning to word segmentation of historical texts. In addition, we apply the algorithm to the MING SHILU. We use active learning because it can reduce the annotation efforts significantly. We also mitigate the disadvantage of unsupervised model that needs large amounts of data to achieve satisfactory accuracy.
1. Kotsiantis, S.B., I. Zaharakis, and P. Pintelas, Supervised machine learning: A review of classification techniques. 2007.
2. Li, S. and C.-R. Huang. Word Boundary Decision with CRF for Chinese Word Segmentation. in PACLIC. 2009.
3. Feng, H., et al. Unsupervised Segmentation of Chinese Corpus Using Accessor Variety. in IJCNLP. 2004. Springer.
4. Jin, Z. and K. Tanaka-Ishii. Unsupervised segmentation of Chinese text by use of branching entropy. in Proceedings of the COLING/ACL on Main conference poster sessions. 2006. Association for Computational Linguistics.
5. Magistry, P. and B. Sagot. Unsupervized word segmentation: the case for mandarin chinese. in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. 2012. Association for Computational Linguistics.
6. Wang, H., et al., A new unsupervised approach to word segmentation. Computational Linguistics, 2011. 37(3): p. 421-454.
7. Shannon, C., (1948)," A Mathematical Theory of Communication", Bell System Technical Journal, vol. 27, pp. 379-423 & 623-656, July & October. 1948.
8. Peng, F., F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields. in Proceedings of the 20th international conference on Computational Linguistics. 2004. Association for Computational Linguistics.
9. Purandare, A. and T. Pedersen. Word sense discrimination by clustering contexts in vector and similarity spaces. in Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004. 2004.
10. Mikolov, T., et al. Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems. 2013.
11. Mikolov, T., et al., Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.