機器學習分類防疫新聞｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	劉冠麟 Kuan-Lin Liu
論文名稱：	機器學習分類防疫新聞 A Study on Text Classification for epidemic prevention News
指導教授：	張大中 Dah-Chung Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 通訊工程學系在職專班 Executive Master of Communication Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	中文
論文頁數：	78
中文關鍵詞：	機器學習、文本分類、新聞分類
外文關鍵詞：	Machine learning, Text Classification, News Classification
相關次數：	點閱：11 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

2019年12月於中國大陸湖北武漢地區，發現新型冠狀病毒，隨後在2020年初迅速蔓延至全球，逐漸造成全球性的大瘟疫，被多個國際組織及新聞媒體形容是多個國際組織及傳媒形容為自第二次世界大戰以來全球面臨的最嚴峻危機。截至2020年5月，全球已有220多個國家和地區累計報告逾471萬名確診病例，逾35萬名患者死亡。
本文於新冠肺炎全球大流行的背景，在台灣每日約有一半以上的新聞報導皆與新冠肺炎或是防疫知識相關，在本篇研究，我們利用決策樹、支援向量機、隨機森林、樸素貝氏分等分類器來對分類防疫新聞，本研究分類防疫新聞和其他新聞，對於只有兩種分類的情況下雜訊是非常嚴重對於隨機森林或是樸素貝氏的正確率會有一定的影響，實驗結果：決策樹有最好的效果(精確度：0.927)。

The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID‑19), caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2). The outbreak was first identified in Wuhan, China , in December 2019. The World Health Organization declared the outbreak a Public Health Emergency of International Concern on 30 January, and a pandemic on 11 March. As of May 2020, more than 4.71 million cases of COVID-19 have been reported in more than 188 countries and territories, resulting in more than 315,000 deaths. More than 1.73 million people have recovered from the virus. this paper is based on the global pandemic of COVID‑19. About half of the daily news reports in Taiwan are related to COVID‑19 or epidemic prevention knowledge. This thesis studies different classification methods for the COVID-19 epidemic prevention news. Based on practical news data collected from web pages, our simulation results show that the decision tree method achieves the best
classification result with an accuracy of 0.927.

摘  要    i
Abstract    ii
誌謝　　　　　　    iii
圖 目 錄    vi
表 目 錄    viii
   緒論    - 1 -
1.    研究背景    - 1 -
2.    文獻探討    - 1 -
3.    章節架構    - 4 -
4.    新冠肺炎 COVID-19    - 4 -
   背景說明    - 6 -
1.    Weka    - 6 -
2.    Visual Studio Code    - 6 -
3.    Python    - 6 -
4.    SQL-Lite    - 7 -
5.    結巴(jieba)    - 7 -
6.    CKIP斷詞系統    - 7 -
7.    TF-IDF(Term Frequency Inverse Document Frequency)    - 8 -
8.    詞嵌入(word embedding)    - 9 -
9.    決策樹(Decision Tree)    - 9 -
10.    隨機森林(Random Forest)    - 10 -
11.    樸素貝氏分類器(Naïve Bayesian Classifier)    - 10 -
12.    支援向量(Support Vector Machine-SVM)    - 11 -
   研究內容與方法    - 12 -
1.    研究架構(Research framework)    - 12 -
2.    爬蟲(web crawler)    - 12 -
3.    結巴斷詞(Jieba)    - 15 -
4.    CKIP斷詞    - 18 -
5.    文本預處理(pre-processing)    - 19 -
6.    刪除停用詞(Delete Stop Words)    - 22 -
7.    詞嵌入(word embedding)    - 23 -
8.    決策樹(Decision Tree)    - 26 -
9.    隨機森林(Random Forest)    - 27 -
10.    樸素貝氏分類器(Naïve Bayesian Classifier)    - 28 -
11.    支援向量(Support Vector Machine-SVM)    - 28 -
   實驗結果    - 34 -
1.    評估方式    - 34 -
2.    實驗資料    - 37 -
3.    決策樹(Decision Tree)分類結果    - 39 -
4.    隨機森林(Random Forest) 分類結果    - 39 -
5.    樸素貝氏(Naïve Bayes) 分類結果    - 40 -
6.    支援向量機(SVM) 分類結果    - 41 -
7.    CKIP斷詞與結巴斷詞實驗分類結果比較    - 42 -
8.    實驗結論    - 45 -
   結論    - 47 -
1.    總結    - 47 -
2.    未來展望    - 47 -
   參考資料    - 49 -
   附件    - 52 -
1.    WEKA 訓練分類器操作    - 52 -
2.    決策樹(Decision Tree)參數設定    - 55 -
3.    決策樹ROC-Aera、PRC-Aera    - 56 -
4.    隨機森林(Random Forest)參數設定    - 58 -
5.    隨機森林(Random Forest)ROC曲線、PR曲線    - 59 -
6.    單純貝式分類(Naïve Bayes)參數設定    - 61 -
7.    單純貝式分類(Naïve Bayes) ROC曲線、PR曲線    - 62 -
8.    支援向量(SVM)參數設定    - 64 -
9.    支援向量機(SVM) ROC曲線、PR曲線    - 65 -


                                

[1] 衛生福利部-衛授疾字第 1090100030 號公告

[2] weka-wiki https://zh.wikipedia.org/wiki/Weka

[3] visual studio code https://azure.microsoft.com/zh-tw/products/visual-studio-code/

[4] Python https://zh.wikipedia.org/zh-tw/Python

[5] SQL-Lite https://zh.wikipedia.org/zh-tw/SQLite

[6] Jieba https://github.com/fxsjy/jieba/wiki

[7]決策樹 https://zh.wikipedia.org/wiki/%E5%86%B3%E7%AD%96%E6%A0%91

[8]隨機森林https://zh.wikipedia.org/zhtw/%E9%9A%8F%E6%9C%BA%E6%A3%AE%E6%9E%97

[9]樸素貝氏 https://zh.wikipedia.org/wiki/%E6%9C%B4%E7%B4%A0%E8%B4%9D%E5%8F%B6%E6%96%AF%E5%88%86%E7%B1%BB%E5%99%A8

[10]支援向量機
https://zh.wikipedia.org/wiki/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA

[11]李宏毅教授,台灣大學-機器學習課程講義https://speech.ee.ntu.edu.tw/~tlkagk/courses_ML19.html

[12] 爬蟲教學 CrawlerTutorial- https://github.com/leVirve/CrawlerTutorial

[13] 陳鄞,哈爾濱工業大學自然語言處理課程 https://slidesplayer.com/slide/11334254/

[14] Qian-Xiang Lin , Chia-Hui Chang , and Chen-Ling Che,A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling Computational Linguistics and Chinese Language Processing Vol. 15, No. 3-4, September/December 2010, pp. 161-180

[15] Tomas Mikolov Google Inc.,Mountain View,CA Efficient Estimation of Word Representations in Vector Space

[16] Maosong Sun,Jingyang Li, Zhipeng Guo,Yu Zhao,Yabin Zheng, Xiance Si, Zhiyuan Liu. THUCTC: An Efficient Chinese Text Classifier. 2016.

[17] Wei-Yun Ma, Keh-Jiann Chen IJALP,Design of CKIP Chinese Word Segmentation System Vol. 14, No. 3, pp. 235–249, May 2004

[18] ZHILIANG ZHU, JIE LIANG, DEYANG LI , HAI YU , Hot Topic Detection Based on a Refined TF-IDF Algorithm AND GUOQI LIU Software College, Northeastern University, Shenyang 110169, China

[19] JINGANG LIU, CHUNHE XIA , HAIHUA YAN , ZHIPU XIE , AND JIE SUN Hierarchical Comprehensive Context Modeling for Chinese Text Classification Received September 11, 2019, accepted October 15, 2019, date of publication October 23, 2019, date of current version November 4, 2019.

[20] Fang Miao, Pu Zhang, Libiao Jin, Hongda Wu ,Chinese News Text Classification Based on Machine learning algorithm 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics

[21] 鄭亦渟,新聞分類方法之比較及推薦系統設計與實作,國立中正大學資訊工程研究所碩士論文

[22] 鍾智孫,PTT網站餐廳美食類別擷取之研究, 國立中央大學資訊工程系碩士論文

[23]Chien-Lung Chou and Chia-Hui Chang and Ya-Yun Huang, " Boosted Web Named Entity Recognition via Tri-Training", ACM Trans. Asian Low-Resour. Lang. Inf. Process. , Vol 16, pp. 10:1--10:23, December 2016.

[23] Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim and Sung Hyon Myaeng, "Some Effective Techniques for Naive Bayes Text Classification," in IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457-1466, Nov. 2006, doi: 10.1109/TKDE.2006.180.

[24] Tin Kam Ho, "Random decision forests," Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Quebec, Canada, 1995, pp. 278-282 vol.1, doi: 10.1109/ICDAR.1995.598994.

[25]W. Zhao, G. Zhang, G. Yuan, J. Liu, H. Shan and S. Zhang, "The Study on the Text Classification for Financial News Based on Partial Information," in IEEE Access, vol. 8, pp. 100426-100437, 2020, doi: 10.1109/ACCESS.2020.2997969.

[26] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[27] D. Isa, L. H. Lee, V. P. Kallimani and R. RajKumar, "Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine," in IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1264-1272, Sept. 2008, doi: 10.1109/TKDE.2008.76.

[28] CKIP LAB 中文斷詞小組 https：//ckip.iis.sinica.edu.tw/demo/

[29] TF-IDF https：//zh.wikipedia.org/wiki/Tf-idf

簡易檢索 / 詳目顯示

相關論文