跳到主要內容

簡易檢索 / 詳目顯示

研究生: 鍾家蓉
Jia-Rong Zhong
論文名稱: 深度學習演算法於遺漏值填補之研究
Deep Learning in Missing Value Imputation
指導教授: 蔡志豐
Chih-Fong Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 83
中文關鍵詞: 資料探勘深度學習資料離散化遺漏值資料前處理
外文關鍵詞: Data Mining, Deep Learning, Data Discretization, Missing Value, Data pre-processing
相關次數: 點閱:15下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊科技快速的發展,人們能快速的收集到各式各樣且大量的資料,而電腦運算能力的效能提升,也使得資料探勘 (Data Mining) 的技術日趨成熟。但在收集資料的過程,難免會遇到資料遺漏 (Data Missing) 的情況,若沒有將這些資料經過適當的前處理,這些不完整的資料往往會導致資料探勘的效能不佳,進而造成準確度的降低。近年來有傳統的統計補值法與機器學習補值法,但現有的文獻中並無探討深度學習對於遺漏值填補效用。再者,資料離散化 (Data Discretization) 能降低離群值對預測結果的干擾,提高模型的穩定性,但是現有文獻並無探討面對資料遺漏時,執行資料離散化與遺漏值填補的順序性對於資料預測之正確率影響之相關研究。因此本論文欲探討與分析各種補值法在不同模型下的表現,以及搭配離散化的技術,探討資料離散化與遺漏值填補的順序性,對於模型預測正確率之影響。
    本研究提出以深度學習演算法的深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 與深度信念網路 (Deep Belief Network, DBN) 用於建置遺漏值填補的模型並與現有的統計分析與機器學習補值模型進行比較,此外本研究也加入了最小描述長度原則 (Minimum Description Length Principle, MDLP) 與卡方分箱法 (ChiMerge, ChiM) 這兩種離散化技術去搭配前述提到的深度學習補值模型進行實驗,最後利用 SVM 分類正確率作為衡量補值方法的成效。
    根據實驗結果可以觀察出在面對不同類型的資料時深度學習補值法的表現都較為優異,尤其在數值型與混合型資料集,DMLP與DBN分別勝過Baseline 14.70% 與15.88% 以及8.71% 與7.96%,可以發現不完整的資料集經過遺漏值填補能增加其正確率。而針對數值型資料加入離散化後,可以發現搭配MDLP不管是先離散化後補值,還是先補值後離散化,相較下都優於其他搭配組合,其中,先使用MDLP離散化後使用DMLP補值以及先用MDLP離散化後使用DBN補值的分類正確率小贏過單純使用深度學習補值的DMLP 0.74% 與0.52% 且勝過Baseline中使用ChiM的結果2.94% 與2.72%,可以發現離散化技術與深度學習演算法的搭配會影響其正確率。


    With the evolution of Information Technology, people may easily collect various and large amounts of data. Consequently, data mining has widely considered in many industries. However, it is unavoidable that the collected data usually contain some missing values. If we do not deal with these missing data appropriately, the data mining results will be affected and the accuracies of learning models may be degraded. In related literature, missing value imputation by some statistical analyses and machine learning techniques has shown its applicability in solving incomplete data problems. However, very few studies examine the imputation performance of deep learning techniques. In addition, data discretization may further reduce the influence of outliers and increase the stability of models. Therefore, this thesis aims to compare the performances of various imputation models including deep neural networks based on Deep MultiLayer Perceptron (DMLP) and Deep Belief Network (DBN). Moreover, this thesis also examines the performances of different orders to combine data imputation and discretization. Particularly, Minimum Description Length Principle (MDLP) and ChiMerge (ChiM) are used as the discretizers.
    The experimental results show that deep neural networks outperform the other imputation methods, especially for numeric and mixed datasets. For numeric datasets, the accuracies of DMLP and DBN are higher than the baseline by 14.70% and 15.88%, respectively, and 8.71% and 7.96% for mixed datasets. Furthermore, for the combinations of deep neural networks with data discretization by MDLP, no matter which combination order is conducted, the performances are higher than other combinations. Particularly, the classification accuracy rates of MDLP + DMLP and MDLP + DBN are slightly higher than using Imputation (DMLP) alone by 0.74% and 0.52%, respectively, and higher than the Baseline (ChiM) by 2.94% and 2.72%, respectively. Therefore, the experiment shows that the performance would be impacted by the chosen discretizer and deep learning algorithms.

    摘要 i Abstract ii 目錄 iv 表目錄 vi 圖目錄 vii 附表目錄 viii 一、 緒論 1 1-1 研究背景 1 1-2 研究動機 3 1-3 研究目的 5 1-4 研究架構 5 二、 文獻探討 7 2-1 資料遺漏 (Data Missing) 7 2-1-1 完全隨機遺漏 (Missing Completely at Random, MCAR) 7 2-1-2 隨機遺漏 (Missing at Random, MAR) 7 2-1-3 非隨機遺漏 (Missing Not at Random, MNAR) 7 2-2 遺漏值填補 8 2-2-1 傳統的統計分析插補法 8 2-2-2 機器學習演算法於遺漏值填補之方法 9 2-3 深度學習演算法 (Deep learning) 11 2-3-1 深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 11 2-3-2 深度信念網路 (Deep Belief Network, DBN) 13 2-4 資料離散化 (Data Discretization) 14 2-4-1 最小描述長度原則 (Minimum Description Length Principle, MDLP) 16 2-4-2 卡方分箱法 (ChiMerge, ChiM) 17 三、 實驗方法與設計 18 3-1 實驗架構 18 3-2 實驗環境 18 3-2-1 硬體設備與軟體應用 18 3-2-2 資料集 19 3-3 實驗參數設定 20 3-3-1 傳統統計補值法 20 3-3-2 機器學習演算法 20 3-3-3 深度學習演算法 21 3-3-4 離散化演算法 23 3-3-5 分類器與評估標準 23 3-4 實驗流程 24 3-4-1 實驗一 24 3-4-2 實驗二 31 四、 實驗結果 34 4-1 實驗ㄧ結果 34 4-1-1 類別型資料 (Categorical Data) 34 4-1-2 數值型資料 (Numeric Data) 36 4-1-3 混合型資料 (Mix Data) 38 4-1-4 統整 40 4-2 實驗二結果 42 4-2-1 先離散化後補值與先補值後離散 43 4-2-2 統整 47 五、 結論 48 5-1 總結與探討 48 5-2 研究貢獻及未來展望 50 參考文獻 53 附錄一、 分類正確率詳細實驗數據 56 1-1、 類別型資料集 56 1-2、 數值型資料集 60 1-3、 混合型資料 64 附錄二、 深度學習補值與離散化搭配詳細實驗數據 68 2-1、 數值型資料集 68

    [1] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques. Elsevier, 2011.
    [2] I. Witten, E. Frank, and M. Hall, Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, 2011. Accessed: Nov. 04, 2020.
    [3] G. E. A. P. A. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, May 2003.
    [4] S. García, J. Luengo, and F. Herrera, “Discretization,” in Data Preprocessing in Data Mining, S. García, J. Luengo, and F. Herrera, Eds. Cham: Springer International Publishing, 2015, pp. 245–283. Accessed: Nov. 04, 2020.
    [5] S. García, J. Luengo, J. A. Sáez, V. López, and F. Herrera, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 734–750, Apr. 2013.
    [6] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. John Wiley & Sons, 2019.
    [7] J. M. Jerez et al., “Missing data imputation using statistical and machine learning methods in a real breast cancer problem,” Artif. Intell. Med., vol. 50, no. 2, pp. 105–115, Oct. 2010.
    [8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, Art. no. 7553, May 2015.
    [9] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Inf. Fusion, vol. 42, pp. 146–157, Jul. 2018.
    [10] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships,” J. Chem. Inf. Model., vol. 55, no. 2, pp. 263–274, Feb. 2015.
    [11] P. Zhang and B. Ci, “Deep belief network for gold price forecasting,” Resour. Policy, vol. 69, p. 101806, Dec. 2020.
    [12] A. Ben-Hur and J. Weston, “A User’s Guide to Support Vector Machines | SpringerLink,” 2010. Accessed: Jan. 11, 2021.
    [13] D. A. Bennett, “How can I deal with missing data in my study?,” Aust. N. Z. J. Public Health, vol. 25, no. 5, pp. 464–469, 2001.
    [14] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, Second Edition. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2002. Accessed: Nov. 25, 2020.
    [15] M. L. Brown and J. F. Kros, “Data mining and the impact of missing data,” Ind. Manag. Data Syst., vol. 103, no. 8, pp. 611–621, Jan. 2003.
    [16] S. van Buuren, Flexible Imputation of Missing Data, Second Edition. CRC Press, 2018.
    [17] M. R. Raymond and D. M. Roberts, “A Comparison of Methods for Treating Incomplete Data in Selection Research,” Educ. Psychol. Meas., vol. 47, no. 1, pp. 13–26, Mar. 1987.
    [18] K. Strike, K. E. Emam, and N. Madhavji, “Software cost estimation with incomplete data,” IEEE Trans. Softw. Eng., vol. 27, no. 10, pp. 890–908, Oct. 2001.
    [19] E. Acuña and C. Rodriguez, “The Treatment of Missing Values and its Effect on Classifier Accuracy,” Classif. Clust. Data Min. Appl., pp. 639–647.
    [20] T. H. Bø, B. Dysvik, and I. Jonassen, “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., vol. 32, no. 3, pp. e34–e34, Feb. 2004.
    [21] J. L. Schafer, Analysis of Incomplete Multivariate Data. CRC Press, 1997.
    [22] E. Biganzoli, P. Boracchi, L. Mariani, and E. Marubini, “Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach,” Stat. Med., vol. 17, no. 10, pp. 1169–1186, 1998.
    [23] F. K et al., “Predicting disease outcome of non-invasive transitional cell carcinoma of the urinary bladder using an artificial neural network model: results of patient follow-up for 15 years or longer,” International journal of urology : official journal of the Japanese Urological Association, Mar. 2003.
    [24] J. M. Jerez-Aragonés, J. A. Gómez-Ruiz, G. Ramos-Jiménez, J. Muñoz-Pérez, and E. Alba-Conejo, “A combined neural network and decision trees model for prognosis of breast cancer relapse,” Artif. Intell. Med., vol. 27, no. 1, pp. 45–63, Jan. 2003.
    [25] S. Singhal and L. Wu, “Training Multilayer Perceptrons with the Extended Kalman Algorithm,” p. 8, 1988.
    [26] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, 1958.
    [27] G. Panchal, A. Ganatra, Y. P. Kosta, and D. Panchal, “Behaviour Analysis of Multilayer Perceptronswith Multiple Hidden Neurons and Hidden Layers,” Int. J. Comput. Theory Eng., pp. 332–337, 2011.
    [28] E. Fix, Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties. USAF School of Aviation Medicine, 1951.
    [29] A. W.-C. Liew, N.-F. Law, and H. Yan, “Missing value imputation for gene expression data: computational techniques to recover missing data from available information,” Brief. Bioinform., vol. 12, no. 5, pp. 498–513, Sep. 2011.
    [30] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is ‘Nearest Neighbor’ Meaningful?,” in Database Theory — ICDT’99, Berlin, Heidelberg, 1999, pp. 217–235.
    [31] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
    [32] W.-Y. Loh, “Classification and regression trees,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 1, no. 1, pp. 14–23, 2011.
    [33] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1, no. 2. MIT press Cambridge, 2016.
    [34] G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
    [35] K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Netw., vol. 2, no. 3, pp. 183–192, Jan. 1989.
    [36] M. W. Gardner and S. R. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,” Atmos. Environ., vol. 32, no. 14, pp. 2627–2636, Aug. 1998.
    [37] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, May 2006.
    [38] G. E. Hinton and T. J. Sejnowski, “Learning and relearning in Boltzmann machines,” Parallel Distrib. Process. Explor. Microstruct. Cogn., vol. 1, no. 282–317, p. 2, 1986.
    [39] J. Karhunen, T. Raiko, and K. Cho, “Chapter 7 - Unsupervised deep learning: A short review,” in Advances in Independent Component Analysis and Learning Machines, E. Bingham, S. Kaski, J. Laaksonen, and J. Lampinen, Eds. Academic Press, 2015, pp. 125–142. Accessed: Dec. 12, 2020.
    [40] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” in Machine Learning Proceedings 1995, A. Prieditis and S. Russell, Eds. San Francisco (CA): Morgan Kaufmann, 1995, pp. 194–202. Accessed: Dec. 15, 2020.
    [41] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, Sep. 1978.
    [42] U. Fayyad and K. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Sep. 1993. Accessed: Dec. 19, 2020.
    [43] R. Kerber, “Chimerge: Discretization of numeric attributes,” in Proceedings of the tenth national conference on Artificial intelligence, 1992, pp. 123–128.
    [44] J. Chen and J. Shao, “Nearest neighbor imputation for survey data,” J. Off. Stat., vol. 16, no. 2, p. 113, 2000.
    [45] P. Thanh Noi and M. Kappas, “Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery,” Sensors, vol. 18, no. 1, p. 18, 2018.
    [46] P. Zhang, “Model Selection Via Multifold Cross Validation,” Ann. Stat., vol. 21, no. 1, pp. 299–313, 1993.

    QR CODE
    :::