跳到主要內容

簡易檢索 / 詳目顯示

研究生: 蔡瑞文
Rui-Wen Cai
論文名稱: 資料正規化、離散化與資料平衡化之交互影響(以乳癌預測之二分類不平衡資料集為例)
指導教授: 蔡志豐
Chih-Fong Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系在職專班
Executive Master of Information Management
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 65
中文關鍵詞: 正規化離散化合成少數過採樣技術資料前處理交互影響機器學習
外文關鍵詞: Normalization, Discretization, Synthetic Minority Over-sampling Technique, Data Pre-processing Interaction Effects, Machine Learning
相關次數: 點閱:16下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著科技的進步,人類的飲食、生活型態也隨之改變,隨之而來,罹患的疾病也跟著改變,在台灣,1990年罹患癌症而死亡的人數為18,536人,至2020年,已提高至50,161人,整體上升2.7倍,其中,因罹患乳癌而死亡的人數由619人提升至2,655人,達4.29倍,比整體癌症死亡倍數高出不少,然而,這種情況是可以改善的,乳癌在早期治療(0、1期)的存活率可達95%以上,顯示早期發現早期治療的重要性,若能精準的提供乳癌的分析資料,供醫療人員參考,醫療人員便能在早期判斷疾病並給予適當治療,提高乳癌患者存活率。

    本研究提出一套資料多前處理並使用演算法進行乳癌資料分析與預測方法,透過使用正規化、離散化及合成少數過採樣技術(SMOTE)前處理,再分別進行支援向量機、最近鄰、決策樹及隨機森林演算法進行五摺交叉驗證預測模型建構,並與相對應單前處理所建構的模型進行比較,觀察在多前處理交互影響的情形下,對於預測模型的影響。

    本研究分別使用KDD的 X射線圖像大型資料集及UCI的細針穿刺(FNA)圖像小型資料集進行實驗,透過同時使用不同的資料前處理,並搭配演算法進行模型建構,實驗發現,在各個預測模型中,經過正規化SMOTE前處理,相較於各別單前處理,對於AUC提升能有較好的效果,其中以支援向量機提升的AUC最高。由本研究實驗中得知,支援向量機進行X射線圖像且重度類別不平衡的資料集預測時,先進行正規化SMOTE資料前處理,可取得較優秀預測價值的模型,細針穿刺(FNA)圖像且輕度類別不平衡資料集,在進行正規化SMOTE後,雖有提升,但較無明顯差異。


    With the advancement of science and technology, people’s diets and lifestyles have also changed, and consequently, the diseases they suffer from have also changed. In Taiwan, the number of people who died of cancer in 1990 was 18,536. By 2020, it has been Increased to 50,161 people, an overall increase of 2.7 times. Among them, the number of deaths due to breast cancer increased from 619 to 2,655, reaching 4.29 times, which is much higher than the overall cancer death rate. However, this situation can be improved. The survival rate of breast cancer in early treatment (stage 0 and 1) can reach more than 95%, showing the importance of early detection and early treatment. If accurate analysis data of breast cancer can be provided for medical staff’s reference, medical staff can Determine the disease and give appropriate treatment to improve the survival rate of breast cancer patients.

    This study proposes a set of data multi-preprocessing and algorithms for breast cancer data analysis and prediction methods, By using normalization, discretization, and Synthetic Minority Over-sampling Technique(SMOTE) preprocessing, and then perform support vector machine, K-nearest neighbor, decision tree , and random forest algorithm were used to construct a five-fold cross-validation prediction model, and compared with the model constructed by the corresponding single pre-processing to observe the impact on the prediction model in the case of the interaction of multiple pre-processing.

    In this study, KDD's X-ray image large data set and UCI's fine needle aspiration (FNA) image small data set were used for experiments. By using different data preprocessing at the same time, and using algorithms for model construction, the experiment found that. In each prediction model, the normalized SMOTE pre-processing has a better effect on the AUC improvement than the individual pre-processing. Among them, the AUC improved by the support vector machine is the highest. From the experiments of this research, it is known that when the support vector machine performs the prediction of the X-ray image and the data set with severe class imbalance, the normalized SMOTE data pre-processing can obtain the model with better prediction value, fine needle aspiration (FNA) Images and slightly class-imbalanced datasets, after regularized SMOTE, have improved, but the impact is small.

    摘 要 i Abstract ii 誌 謝 iv 目 錄 v 圖 目 錄 vii 表 目 錄 viii 第1章 前 言 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 3 1.4 論文架構 3 第2章 文獻探討 5 2.1 乳癌特徵與因素 5 2.2 機器學習技術 6 2.2.1 監督式學習 6 2.2.2 支援向量機(Support Vector Machine,SVM) 6 2.2.3 最近鄰演算法(K-NN) 7 2.2.4 決策樹(Decision Tree,DT) 8 2.2.5 隨機森林(Random Forest,RF) 8 2.3 前處理 10 2.3.1 正規化(Normalization) 10 2.3.2 離散化(Discretization) 11 2.3.3 合成少數過採樣技術(SMOTE) 11 2.4 相關文獻回顧與討論 12 第3章 研究方法 19 3.1 研究架構 19 3.2 資料探勘軟體 21 3.3 實驗資料集 21 3.4 預處理程序及資料集分割 22 3.5 預測模型評估 25 第4章 實驗成果 30 4.1 KDD Breast Cancer(2008) 32 4.2 Breast Cancer Wisconsin (Diagnostic) 39 4.3 模型效能評估 46 第5章 研究結論與建議 48 5.1 結論 48 5.2 未來研究方向與建議 49 5.3 研究限制 50 參考文獻 51

    [1]衛生福利部統計處, “109年國人死因統計結果”(更新於8月 19, 2021)。
    檢自https://www.mohw.gov.tw/cp-5017-61533-1.html (引見於 11月 04, 2021).
    [2]衛生福利部, “死因統計/歷年統計”。
    檢自https://dep.mohw.gov.tw/DOS/lp-5069-113.html (引見於 11月 04, 2021).
    [3]衛生福利部國民健康署, “乳癌防治”。
    檢自https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=614&pid=1124(引見於 11月 04, 2021).
    [4]Li, Y., Sun, G., & Zhu, Y. (2010, October). Data imbalance problem in text classification. In 2010 Third International Symposium on Information Processing (pp. 301-305). IEEE.
    [5]Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    [6]Jayalakshmi, T., & Santhakumaran, A. (2011). Statistical normalization and back propagation for classification. International Journal of Computer Theory and Engineering, 3(1), 1793-8201.
    [7]Althunibat, A., Alzyadat, W., Muhairat, M., Alhroob, A., & Almukahel, I. H. (2021). An Approach to Acquire the Constraints Using Panel Big Data Hybrid Association Rule and Discretization Process for Breast Cancer Prediction. Journal of Healthcare Engineering, 2021.
    [8]Chaurasia, V., Pal, S., & Tiwari, B. B. (2018). Prediction of benign and malignant breast cancer using data mining techniques. Journal of Algorithms & Computational Technology, 12(2), 119-126.
    [9]Fahad Ullah, M. (2019). Breast cancer: current perspectives on the disease status. Breast Cancer Metastasis and Drug Resistance, 51-64.
    [10]Momenimovahed, Z., & Salehiniya, H. (2019). Epidemiological characteristics of and risk factors for breast cancer in the world. Breast Cancer: Targets and Therapy, 11, 151.
    [11]Huang, S., Cai, N., Pacheco, P. P., Narrandes, S., Wang, Y., & Xu, W. (2018). Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics & proteomics, 15(1), 41-51.
    [12]Ahmad, L. G., Eshlaghy, A. T., Poorebrahimi, A., Ebrahimi, M., & Razavi, A. R. (2013). Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform, 4(124), 3.
    [13]Khan, M. M. R., Arif, R. B., Siddique, M. A. B., & Oishe, M. R. (2018, September). Study and observation of the variation of accuracies of KNN, SVM, LMNN, ENN algorithms on eleven different datasets from UCI machine learning repository. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT) (pp. 124-129). IEEE..
    [14]Sumbaly, R., Vishnusri, N., & Jeyalatha, S. (2014). Diagnosis of breast cancer using decision tree data mining technique. International Journal of Computer Applications, 98(10).
    [15]Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
    [16]Suryachandra, P., & Reddy, P. V. S. (2016, August). Comparison of machine learning algorithms for breast cancer. In 2016 International Conference on Inventive Computation Technologies (ICICT) (Vol. 3, pp. 1-6). IEEE.
    [17]Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
    [18]Baka, A., Wettayaprasit, W., & Vanichayobon, S. (2014, May). A novel discretization technique using Class Attribute Interval Average. In 2014 Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP) (pp. 95-100). IEEE.
    [19]Islam, M. M., Haque, M. R., Iqbal, H., Hasan, M. M., Hasan, M., & Kabir, M. N. (2020). Breast cancer prediction: a comparative study using machine learning techniques. SN Computer Science, 1(5), 1-14.
    [20]Castaldo, R., Pane, K., Nicolai, E., Salvatore, M., & Franzese, M. (2020). The impact of normalization approaches to automatically detect radiogenomic phenotypes characterizing breast cancer receptors status. Cancers, 12(2), 518.
    [21]Aroef, C., Rivan, Y., & Rustam, Z. (2020). Comparing random forest and support vector machines for breast cancer classification. Telkomnika, 18(2), 815-821.
    [22]Assegie, T. A. (2021). An optimized K-Nearest Neighbor based breast cancer detection. Journal of Robotics and Control (JRC), 2(3), 115-118.
    [23]Mohammed, S. A., Darrab, S., Noaman, S. A., & Saake, G. (2020, July). Analysis of breast cancer detection using different machine learning techniques. In International Conference on Data Mining and Big Data (pp. 108-117). Springer, Singapore.
    [24]袁梅宇(2017),王者歸來:WEKA機器學習與大數據聖經(第三版),佳魁資訊。

    QR CODE
    :::