跳到主要內容

簡易檢索 / 詳目顯示

研究生: 戴郁庭
Yu-Ting Dai
論文名稱: 以動態時間校正進行類別不平衡資料之遺漏值處理
Missing value imputation for class imbalance data: a dynamic warping approach
指導教授: 蔡志豐
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 60
中文關鍵詞: 類別不平衡遺漏值補值方法動態時間校正
相關次數: 點閱:17下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在充滿資料的世界中,越來越多企業希望可以運用這些資料來提高企業競爭力,然而真實世界中類別不平衡(Class Imbalance)以及資料遺漏(Missing Value)的問題一直是非常重要的問題,如醫學診療、破產預測等不同領域都經常發生類別不平衡問題,在類別不平衡中問題中,資料集出現大類資料(Majority Class)的樣本數大於小類資料(Minority Class)的樣本數,資料也因此呈現偏態分布,為了有較高的分類正確率,使用一般的分類器所建立出來的預測模型也會因受到偏態分布的影響而誤判為大類資料,此外若這些珍貴的小類資料出現遺漏時,可用的資料點就更加稀少。
    本論文基於動態時間校正(Dynamic Time Warping)的概念作為核心,使用與過去不同的補值方式進行補值,利用動態時間校正的特點來解決小類樣本出現資料遺漏的問題,而此方法也不受限於需要完整資料列做為補值參考,因此在實驗中會將小類資料模擬10%、30%、50%、70%、90%的資料遺漏。
    本論文實驗了17個KEEL,搭配二種分類器(SVM、Decision Tree)建立分類模型,比較不同補值方式的AUC(Area Under Curve)結果。最後,KEEL資料集的實驗結果顯示,使用動態時間校正和K-NN補值法比較後,在50%~90%的資料遺漏率下,動態時間校正的補值依然有著良好的表現。


    In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer.
    In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated.
    In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.

    摘要 i Abstract ii 圖目錄 v 表目錄 vi 一、 緒論 1 1-1研究背景 1 1-2研究動機 2 1-3研究目的 3 1-4研究架構 4 二、 文獻探討 6 2-1類別不平衡問題 6 2-2解決類別不平衡問題之文獻探討 8 2-2-1資料層級(Data level) 8 2-3遺漏值問題 12 2-3-1完全隨機遺漏(Missing Completely at Random,MCAR) 13 2-3-2隨機遺漏(Missing at Random,MAR) 13 2-3-3非隨機遺漏(Missing Not at Random,MNAR) 14 2-4遺漏值填補方法 14 2-4-1案例刪除法(Case-Deletion) 14 2-4-2單一補值法(Single Imputation) 15 2-4-3 K-鄰近算法(K-Nearest Neighbor,KNN) 16 2-5 Dynamic Time Warping 17 三、 研究方法 20 3-1 研究架構 20 3-2 實驗資料集 21 3-3 DTW演算法補值 22 3-3-1 所有樣本皆可完成補值且並不會出現例外狀況 23 3-3-2 出現例外狀況 24 3-3-3 待補資料樣本皆為遺漏值 25 四、 實驗結果 27 4-1實驗準備 27 4-1-1軟硬體設備 27 4-2實驗結果與總結 27 4-2-1實驗結果──使用Support Vector Machines 27 4-2-2實驗小結──使用Support Vector Machines 31 4-2-3實驗結果──使用Decision Tree 31 4-2-4實驗小結──使用Decision Tree 35 4-3實驗討論 36 4-3-1討論一──缺少完整資料樣本之影響 36 4-3-2討論二──不平衡比率對補值之影響 37 五、 結論 45 5-1結論與貢獻 45 5-2未來研究方向與建議 45 參考文獻 47

    [1]. He, H. and E.A. Garcia, Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, 2008(9): p. 1263-1284.
    [2]. Cios, K.J. and L.A. Kurgan, Trends in Data Mining and Knowledge Discovery, in Advanced Techniques in Knowledge Discovery and Data Mining, N.R. Pal and L. Jain, Editors. 2005, Springer London: London. p. 1-26.
    [3]. Mazurowski, M.A., et al., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 2008. 21(2-3): p. 427-436.
    [4]. Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012. 42(4): p. 463-484.
    [5]. Tsai, C.-F. and F.-Y. Chang, Combining instance selection for better missing value imputation. Journal of Systems and Software, 2016. 122: p. 63-71.
    [6]. Ader, H.J., Advising on research methods: A consultant's companion. 2008: Johannes van Kessel Publishing.
    [7]. Tsai, C.-F., M.-L. Li, and W.-C. Lin, A class center based approach for missing value imputation. Knowledge-Based Systems, 2018. 151: p. 124-135.
    [8]. Longadge, R. and S. Dongre, Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707, 2013.
    [9]. Salvador, S. and P. Chan, Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 2007. 11(5): p. 561-580.
    [10]. Müller, M., Dynamic time warping. Information retrieval for music and motion, 2007: p. 69-84.
    [11]. Lin, W.-C., et al., Clustering-based undersampling in class-imbalanced data. Information Sciences, 2017. 409: p. 17-26.
    [12]. Ali, A., S.M. Shamsuddin, and A.L. Ralescu, Classification with class imbalance problem: a review. Int. J. Advance Soft Compu. Appl, 2015. 7(3): p. 176-204.
    [13]. Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study. Intelligent data analysis, 2002. 6(5): p. 429-449.
    [14]. Das, B., N.C. Krishnan, and D.J. Cook. Handling class overlap and imbalance to detect prompt situations in smart homes. in 2013 IEEE 13th International Conference on Data Mining Workshops. 2013. IEEE.
    [15]. Batista, G.E., R.C. Prati, and M.C. Monard, A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 2004. 6(1): p. 20-29.
    [16]. Kotsiantis, S., D. Kanellopoulos, and P. Pintelas, Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006. 30(1): p. 25-36.
    [17]. Fernández, A., et al., A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 2008. 159(18): p. 2378-2398.
    [18]. Drummond, C. and R.C. Holte. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. in Workshop on learning from imbalanced datasets II. 2003. Citeseer.
    [19]. Kotsiantis, S. and P. Pintelas, Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 2003. 1(1): p. 46-55.
    [20]. Tomek, I., Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 1976. 6: p. 769-772.
    [21]. Hart, P., The condensed nearest neighbor rule (Corresp.). IEEE transactions on information theory, 1968. 14(3): p. 515-516.
    [22]. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 2002. 16: p. 321-357.
    [23]. Little, R.J. and D.B. Rubin, Statistical analysis with missing data. Vol. 333. 2014: John Wiley & Sons.
    [24]. Scheffer, J., Dealing with missing data. 2002.
    [25]. Lakshminarayan, K., S.A. Harp, and T. Samad, Imputation of missing data in industrial databases. Applied intelligence, 1999. 11(3): p. 259-275.
    [26]. Silva-Ramírez, E.-L., R. Pino-Mejías, and M. López-Coello, Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns. Applied Soft Computing, 2015. 29: p. 65-74.
    [27]. Schafer, J.L., Analysis of incomplete multivariate data. 1997: Chapman and Hall/CRC.
    [28]. Farhangfar, A., L. Kurgan, and J. Dy, Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 2008. 41(12): p. 3692-3705.
    [29]. Cohen, P., S.G. West, and L.S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences. 2014: Psychology Press.
    [30]. Farhadian, H. and H. Katibeh, New empirical model to evaluate groundwater flow into circular tunnel using multiple regression analysis. International Journal of Mining Science and Technology, 2017. 27(3): p. 415-421.
    [31]. Cho, S.-B., Towards creative evolutionary systems with interactive genetic algorithm. Applied Intelligence, 2002. 16(2): p. 129-138.
    [32]. Troyanskaya, O., et al., Missing value estimation methods for DNA microarrays. Bioinformatics, 2001. 17(6): p. 520-525.
    [33]. Keogh, E. and C.A. Ratanamahatana, Exact indexing of dynamic time warping. Knowledge and information systems, 2005. 7(3): p. 358-386.
    [34]. Senin, P., Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 2008. 855: p. 1-23.
    [35]. Keogh, E.J. and M.J. Pazzani. Derivative dynamic time warping. in Proceedings of the 2001 SIAM international conference on data mining. 2001. SIAM.
    [36]. Zhang, Z., et al., Dynamic time warping under limited warping path length. Information Sciences, 2017. 393: p. 91-107.

    QR CODE
    :::