基於隱含狄利克雷分布進行開放式問卷之主題導向文字探勘

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳昱儒 Yu-Ju Chen
論文名稱：	基於隱含狄利克雷分布進行開放式問卷之主題導向文字探勘 Topic-oriented Text Mining on Open-ended Questionnaires using Latent Dirichlet Allocation
指導教授：	蔡孟峰
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	43
中文關鍵詞：	學習成效、文字探勘、主題模型、隱含狄利克雷分布
相關次數：	點閱：9 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　　近年來，隨著教育政策的改變，國內各大學皆致力於提升學生的學習成效。而學習成效最常使用的評估方法，即是以教學問卷的方式於各學期對學生進行調查。為提供學生最直接的回饋管道，課程問卷除了針對特定項目的調查外，通常也包含了開放自由填答的意見欄，供學生填寫心得與建議。
　　意見欄因為是由學生以文字填寫，故沒有既定的形式與規範，使得這些資料不能和一般數據資料一樣經過簡單的處理後就能進行分析。這些文字資料大多沒有一定的架構，並且因為是以人工填寫，有時甚至會出現用字或語法上的錯誤，本研究即是針對這些問卷中的非結構化之文字資料進行文字探勘。
　　由於教學評量中的文字資料內容繁雜多樣且缺乏分類標註，使得監督式學習的分類方法難以應用於此，故本研究以非監督式學習的主題分析，探索隱含的主題分布。主題模型能在沒有分類標註與訓練資料的情形下，利用字詞於文檔中的分布模式找出主題，並將主題相近的文檔群聚在一起。本研究所使用的文字聚類方法，是以吉布斯採樣實踐隱含狄利克雷分布，並進一步以此模型對新進資料的主題分布進行分析。
　　本研究對教學問卷中的文字資料進行主題分析，實現了初步的自動化文字資料分群。希望能提供問卷分析者更為便捷的分析方法，亦期望作能為日後自動化問卷文字資料分析的基礎。

As the education system evolved over the past few years, domestic universities are committed to improving students’ learning outcomes. The most common way of evaluating learning outcomes is through questionnaires, filled in by students at the midst and the end of each semester. To provide students a way to give more detailed feedbacks, these questionnaires usually contain a section for students to give comments through pure text.
The comment section is designed for students to write any thoughts and opinions, there aren’t any restrictions or rules to how it should be written. These human-generated text are unstructured, and often contain writing mistakes and miss used words. With the lack of structure, it is hard for these text data to be processed as normal data using data mining techniques. Thus, we aim to analyze these text data from course evaluation questionnaires though text mining.
Due to the miscellaneous content and the fact that there aren’t enough human-labeled data, it is hard to perform supervised classification methods on these text. Therefore, we use an unsupervised topic analysis technique to find the latent topic distribution of the data. Topic modeling can infer latent topic distributions and cluster similar documents without defining topic labels or train data beforehand. We perform topic modeling by implementing latent Dirichlet allocation (LDA) using Gibbs sampling, and further estimate unseen data with the LDA model.
In this thesis, we imply topic analysis on the comment section of the course evaluation questionnaire. We believe that with this automatic topic modeling method, it would be more efficient for analysts to analyze text data in questionnaires. Moreover, future work on automatic questionnaire analysis can be built on this approach.

摘要    i
Abstract    ii
目 錄    iv
圖目錄    vi
表目錄    vii
一、    緒論    1
1-1研究背景    1
1-2研究動機與目的    2
1-3論文架構    3
二、文獻探討    4
2-1文字探勘    4
2-2 主題模型與相關研究    5
2-2-1 潛在語意分析 (Latent Semantic Analysis)    6
2-2-2機率潛在語意分析(Probabilistic Latent Semantic Analysis)    7
2-2-3隱含狄利克雷分布(Latent Dirichlet Allocation)    8
三、系統流程與架構    9
3-1問題定義    9
3-2系統流程    10
3-2-1資料前處理與建立語料庫    10
3-2-2建立主題模型    11
3-2-3模型選擇    11
四、研究方法    12
4-1資料來源與前置處理    12
4-1-1資料來源    12
4-1-2資料前處理    13
4-2建立主題模型    14
4-2-1 吉布斯採樣    14
4-2-2以吉布斯採樣實踐LDA    15
4-3模型選擇    15
4-3-2 Topic coherence    16
4-3-1 Evidence Lower Bound    16
五、實驗結果與分析    18
5-1實驗規格與資料    18
5-1-1實驗資料    18
5-2 主題建模結果分析    21
5-2-1 所有資料集    21
5-2-2一般問卷    23
5-2-3體育問卷與實驗問卷    25
5-3目的性主題分析    27
5-4執行時間    29
六、結論    30
七、參考文獻    31
                                

[1] 邁向頂尖大學計畫。取自網路
https://zh.wikipedia.org/wiki/%E9%82%81%E5%90%91%E9%A0%82%E5%B0%96%E5%A4%A7%E5%AD%B8%E8%A8%88%E7%95%AB
[2] 獎勵大學教學卓越計畫。取自網路
https://zh.wikipedia.org/wiki/%E7%8D%8E%E5%8B%B5%E5%A4%A7%E5%AD%B8%E6%95%99%E5%AD%B8%E5%8D%93%E8%B6%8A%E8%A8%88%E7%95%AB
[3] 發展典範科技大學計畫。取自網路
https://zh.wikipedia.org/wiki/%E7%99%BC%E5%B1%95%E5%85%B8%E7%AF%84%E7%A7%91%E6%8A%80%E5%A4%A7%E5%AD%B8%E8%A8%88%E7%95%AB
[4] 高等教育深耕計畫。取自網路
https://zh.wikipedia.org/wiki/%E9%AB%98%E7%AD%89%E6%95%99%E8%82%B2%E6%B7%B1%E8%80%95%E8%A8%88%E7%95%AB
[5] Ronen Feldman and Ido Dagan. (1995). Knowledge Discovery in Textual Databases (KDT) In KDD, Vol. 95. 112–117.
[6] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman (1990). Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science banner, Vol. 41, 391-407
[7] Choi, P. Wiemer-Hastings, J. Moore. (2001). Latent semantic analysis for text segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 109–117, Pittsburgh, PA
[8] Jen-YuanYeh, Hao-RenKe, Wei-Pang Yang, I-Heng Meng (2004), Text summarization using a trainable summarizer and latent semantic analysis, Information Processing and Management, Volume 41, Issue 1, Pages 75-95
[9] Gene H. Golub, Michael W. Mahoney, Petros Drineas, and Lek-Heng Lim (2006), Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications, SIAM News, Vol. 39, No. 8.
[10] T. Hofmann (1999), Probabilistic Latent Semantic Analysis, Proc. 15th Conf. Uncertainty in Artificial Intelligence, Pages 289-296.
[11] T. Hofmann (2001), Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning, Volume 42, Issue 1–2, pp 177–196
[12] Xin Jin, Yanzan Zhou, and Bamshad Mobasher. 2004. Web usage mining based on probabilistic latent semantic analysis. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '04). ACM, New York, NY, USA, 197-205.
[13] Florent Monay and Daniel Gatica-Perez. (2004). PLSA-based image auto-annotation: constraining the latent space. In Proceedings of the 12th annual ACM international conference on Multimedia (MULTIMEDIA '04). ACM, New York, NY, USA, 348-351.
[14] David M. Blei , Andrew Y. Ng , Michael I. Jordan (2003), Latent dirichlet allocation, The Journal of Machine Learning Research, 3, Pages.993-1022
[15] Alexander Grossa, Dhiraj Murthy (2014), Modeling virtual organizations with Latent Dirichlet Allocation: A case for natural language processing, Neural Networks, Volume 58, October 2014, Pages 38-49
[16] David Alfred Ostrowski (2015), Using latent dirichlet allocation for topic modelling in twitter, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015)
[17] Michael Röder, Andreas Both, Alexander Hinneburg (2015), Exploring the Space of Topic Coherence Measures, In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Pages 399-408
[18] David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin (2010), Automatic Evaluation of Topic Coherence, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 100–108
[19] Lynch S.M. (2007) Modern Model Estimation Part 1: Gibbs Sampling. In: Lynch S.M. (eds) Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. Statistics for Social and Behavioral Sciences. Springer, New York, NY, pages77-105

簡易檢索 / 詳目顯示

相關論文