跳到主要內容

簡易檢索 / 詳目顯示

研究生: 王珮庭
Pei-Ting Wang
論文名稱: 同質性與異質性集成式重採樣方法於類別不平衡問題之研究
Homogeneous and Heterogeneous Ensemble Resampling Approaches for the Class Imbalance Problem
指導教授: 蔡志豐
Chih-Fong Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 133
中文關鍵詞: 資料探勘類別不平衡集成式學習
外文關鍵詞: data mining, class imbalance, ensemble learning
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在資料探勘領域中,資料的收集往往伴隨著各種資料品質的問題,包括:數據含有重複值 duplicate values 、遺漏值 (missing values)、離群值 (outlier)、資料格式不一 (data inconsistency)等問題,這些問題也間接影響提取有用資訊的困難度。此外,由於現實世界所發生的機率不同,類別不平衡問題(Class Imbalance)也成為資料探勘中一個很重要的課題,此問題會導致在模型預測和分類中,對少數類別的預測性能下降,並對資料分析的準確性和可靠性上產生負面影響。
    因此,本論文主要探討類別不平衡問題。根據過往文獻,本研究以資料層級方法,彈性搭配不同分類演算法方式,來對類別不平衡資料集進行重採樣,探討在不同重採樣下,調整類別大小類別比例是否影響分類性能。另外,由於現有文獻中並未提出將不同重採樣所訓練的單一分類器進行集成建立成多重分類器,以及將不同重採樣樣本進行合併,搭配單一分類器或集成式分類器。因此,本研究以集成式方法(Ensemble Method)為基礎,提出同質性(Homogeneous)和異質性(Heterogeneous)方法,探討在不同處理流程下,哪種組合方式可以更好的處理類別不平衡問題。
    本研究透過實驗結果,證明在資料前處理方法中以資料層級方法對類別不平衡資料集進行重採樣能有效改善分類表現,且重採樣的大小類別平衡比例對分類器表現有顯著的影響。而在全面比較同質性與異質性方法中,多重分類器和樣本合併方法的單一分類器與集成式分類器,在統計結果中並無差異性。但異質性方法相對於同質性方法,更能夠在不同分類演算法上發掘出最佳的搭配方式,提升分類準確率(AUC)。這些實驗結果為後續研究者提供可進一步拓展與改進集成式分類器的方向,並為解決類別不平衡問題提供更多的選擇和優化策略。


    In the field of data mining, data collection often comes with various data quality issues, including duplicate values, missing values, outliers, and data inconsistency, which indirectly affect the difficulty of extracting useful information. Furthermore, the class imbalance has become an important issue in data mining due to the different probabilities of events in the real world. This problem leads to decreased predictive performance for minority classes in model prediction and classification, negatively impacting the accuracy and reliability of data analysis.
    Therefore, this paper focuses on addressing the class imbalance problem. Based on previous literature, this study employs data-level approaches and flexibly combines different
    classification algorithms to resample class-imbalanced datasets. It explores whether adjusting the class proportions under different resampling techniques affects the classification performance. Moreover, since existing literature does not propose the integration of individual classifiers trained with different resampling techniques to build multiple classifiers or merging different resampled samples with single classifiers or ensemble classifiers, this research proposes homogeneous and heterogeneous methods based on ensemble methods to explore which combination approach can better handle class imbalance problems under different processing flows.
    Through experimental results, this study demonstrates that resampling class-imbalanced datasets using data-level techniques in data preprocessing can effectively improve classification performance, and the balance ratio of resampled minority and majority classes significantly influences classifier performance. In the comprehensive comparison between homogeneous and heterogeneous methods, there is no statistical difference between multiple classifiers and the single classifier or ensemble classifier using sample merging. However, heterogeneous methods, compared to homogeneous methods, are more capable of exploring the best combinations with different classification algorithms to enhance classification accuracy (AUC). These experimental results provide directions for further expansion and improvement of ensemble classifiers and offer more choices and optimization strategies for addressing class imbalance problems.

    目錄 摘要................................................................................................................................ i Abstract ......................................................................................................................................ii 誌謝...............................................................................................................................ii 目錄.............................................................................................................................. iv 圖目錄.......................................................................................................................... vi 表目錄.........................................................................................................................vii 一、 緒論..................................................................................................................... 1 1-1 研究背景............................................................................................................ 1 1-2 研究動機............................................................................................................ 3 1-3 研究目的............................................................................................................ 5 1-4 研究架構............................................................................................................ 6 二、文獻探討............................................................................................................... 7 2-1 類別不平衡問題................................................................................................ 7 2-2 解決類別不平衡問題之方法............................................................................ 9 2-2-1 資料層級(Data level) .................................................................................. 9 2-2-2 演算法層級(Algorithm Level).................................................................. 12 2-2-3 混合層級(Hybrid Level)........................................................................... 15 2-3 相關文獻比較................................................................................................. 15 三、研究方法............................................................................................................. 17 3-1 實驗架構.......................................................................................................... 17 3-2 實驗流程.......................................................................................................... 18 3-2-1 實驗一 ....................................................................................................... 18 3-2-2 實驗二 ....................................................................................................... 23 3-3 實驗環境.......................................................................................................... 27 3-4 實驗資料集...................................................................................................... 27 3-5 實驗環境和參數設定...................................................................................... 29 3-6 分類與評估標準.............................................................................................. 33 四、實驗結果............................................................................................................. 36 4-1 實驗一統整....................................................................................................... 36 4-1-1 單一分類器與集成式分類器................................................................... 36 4-1-2 多重分類器............................................................................................... 43 4-1-3 實驗一:討論........................................................................................... 50 4-2 實驗二統整...................................................................................................... 55 4-2-1 同質性單一分類器與同質性集成式分類器........................................... 56 4-2-2 異質性單一分類器與異質性集成式分類器........................................... 62 4-2-3 實驗二:討論........................................................................................... 68 v 4-2-4 資料視覺化分析....................................................................................... 79 五、結論..................................................................................................................... 83 5-1 結論與貢獻...................................................................................................... 83 5-2 未來研究方向與建議...................................................................................... 84 附錄一、實驗一詳細數據:分類準確率(AUC)...................................................... 92 附錄二、實驗二詳細數據:分類準確率(AUC).................................................... 102

    Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several
    methods for balancing machine learning training data. ACM SIGKDD Explorations
    Newsletter, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
    Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
    https://doi.org/10.1007/BF00058655
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
    Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16,
    321–357. https://doi.org/10.1613/jair.953
    Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving
    Prediction of the Minority Class in Boosting. In N. Lavrač, D. Gamberger, L.
    Todorovski, & H. Blockeel (Eds.), Knowledge Discovery in Databases: PKDD 2003
    (pp. 107–119). Springer. https://doi.org/10.1007/978-3-540-39804-2_12
    Chongomweru, H., & Kasem, A. (2021). A novel ensemble method for classification in
    imbalanced datasets using split balancing technique based on instance hardness
    (sBal_IH). Neural Computing and Applications, 33(17), 11233–11254.
    https://doi.org/10.1007/s00521-020-05570-7
    Dudjak, M., & Martinović, G. (2021). An empirical study of data intrinsic characteristics that
    make learning from imbalanced data difficult. Expert Systems with Applications, 182,
    115297. https://doi.org/10.1016/j.eswa.2021.115297
    Ebrahimi Shahabadi, M. S., Tabrizchi, H., Kuchaki Rafsanjani, M., Gupta, B. B., & Palmieri,
    F. (2021). A combination of clustering-based under-sampling with ensemble methods
    for solving imbalanced class problem in intelligent systems. Technological Forecasting
    and Social Change, 169, 120796. https://doi.org/10.1016/j.techfore.2021.120796
    Elkan, C. (2001). The foundations of cost-sensitive learning. Proceedings of the 17th
    International Joint Conference on Artificial Intelligence - Volume 2, 973–978.
    Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer
    diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089.
    https://doi.org/10.1016/j.jbi.2018.12.003
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on
    Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based
    Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C
    (Applications and Reviews), 42(4), 463–484.
    https://doi.org/10.1109/TSMCC.2011.2161285
    García, V., Sánchez, J. S., & Mollineda, R. A. (2012). On the effectiveness of preprocessing
    methods when dealing with different levels of class imbalance. Knowledge-Based
    Systems, 25(1), 13–21. https://doi.org/10.1016/j.knosys.2011.06.013
    Gong, J., & Kim, H. (2017). RHSBoost: Improving classification performance in imbalance
    data. Computational Statistics & Data Analysis, 111, 1–13.
    https://doi.org/10.1016/j.csda.2017.01.005
    Guo, H., Zhou, J., & Wu, C. (2020). Ensemble learning via constraint projection and
    undersampling technique for class-imbalance problem. Soft Computing, 24(7), 4711–
    4727. https://doi.org/10.1007/s00500-019-04501-6
    Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning
    from class-imbalanced data: Review of methods and applications. Expert Systems with
    Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035
    He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on
    Knowledge and Data Engineering, 21(9), 1263–1284.
    https://doi.org/10.1109/TKDE.2008.239
    Hoyos-Osorio, J., Alvarez-Meza, A., Daza-Santacoloma, G., Orozco-Gutierrez, A., &
    Castellanos-Dominguez, G. (2021). Relevant information undersampling to support
    imbalanced data classification. Neurocomputing, 436, 136–146.
    https://doi.org/10.1016/j.neucom.2021.01.033
    Huda, S., Liu, K., Abdelrazek, M., Ibrahim, A., Alyahya, S., Al-Dossari, H., & Ahmad, S.
    (2018). An Ensemble Oversampling Model for Class Imbalance Problem in Software
    Defect Prediction. IEEE Access, 6, 24184–24195.
    https://doi.org/10.1109/ACCESS.2018.2817572
    Kaur, P., & Gosain, A. (2018). Comparing the Behavior of Oversampling and Undersampling
    Approach of Class Imbalance Learning by Combining Class Imbalance Problem with
    Noise. In A. K. Saini, A. K. Nayak, & R. K. Vyas (Eds.), ICT Based Innovations (pp.
    23–30). Springer. https://doi.org/10.1007/978-981-10-6602-3_3
    Kim, Y., Lee, Y., & Jeon, M. (2021). Imbalanced image classification with complement cross
    entropy. Pattern Recognition Letters, 151, 33–40.
    https://doi.org/10.1016/j.patrec.2021.07.017
    Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions.
    Progress in Artificial Intelligence, 5(4), 221–232. https://doi.org/10.1007/s13748-016-
    0094-0
    Lango, M., & Stefanowski, J. (2018). Multi-class and feature selection extensions of Roughly
    Balanced Bagging for imbalanced data. Journal of Intelligent Information Systems,
    50(1), 97–127. https://doi.org/10.1007/s10844-017-0446-7
    Lee, W., Jun, C.-H., & Lee, J.-S. (2017). Instance categorization by support vector machines to
    adjust weights in AdaBoost for imbalanced data classification. Information Sciences,
    381, 92–103. https://doi.org/10.1016/j.ins.2016.11.014
    Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in
    class-imbalanced data. Information Sciences, 409–410, 17–26.
    https://doi.org/10.1016/j.ins.2017.05.008
    Liu, Y., Yang, G., Qiao, S., Liu, M., Qu, L., Han, N., Wu, T., Yuan, G., Wu, T., & Peng, Y. (2023).
    Imbalanced data classification: Using transfer learning and active sampling.
    Engineering Applications of Artificial Intelligence, 117, 105621.
    https://doi.org/10.1016/j.engappai.2022.105621
    Lu, W., Li, Z., & Chu, J. (2017). Adaptive Ensemble Undersampling-Boost: A novel learning
    framework for imbalanced data. Journal of Systems and Software, 132, 272–282.
    https://doi.org/10.1016/j.jss.2017.07.006
    Lu, Y., Cheung, Y., & Tang, Y. Y. (2016). Hybrid Sampling with Bagging for Class Imbalance
    Learning. In J. Bailey, L. Khan, T. Washio, G. Dobbie, J. Z. Huang, & R. Wang (Eds.),
    Advances in Knowledge Discovery and Data Mining (pp. 14–26). Springer International
    Publishing. https://doi.org/10.1007/978-3-319-31753-3_2
    Lu, Y., Cheung, Y.-M., & Tang, Y. Y. (2020). Bayes Imbalance Impact Index: A Measure of
    Class Imbalanced Data Set for Classification Problem. IEEE Transactions on Neural
    Networks and Learning Systems, 31(9), 3525–3539.
    https://doi.org/10.1109/TNNLS.2019.2944962
    Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M.-S., & Zeineddine, H. (2019). An
    Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud
    Detection. IEEE Access, 7, 93010–93022.
    https://doi.org/10.1109/ACCESS.2019.2927266
    Mehta, S., & Patnaik, K. S. (2021). Improved prediction of software defects using ensemble
    machine learning techniques. Neural Computing and Applications, 33(16), 10551–
    10562. https://doi.org/10.1007/s00521-021-05811-3
    Mienye, I. D., & Sun, Y. (2022). A Survey of Ensemble Learning: Concepts, Algorithms,
    Applications, and Prospects. IEEE Access, 10, 99129–99149.
    https://doi.org/10.1109/ACCESS.2022.3207287
    Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A
    review. International Journal of Computing and Business Research (IJCBR), 5(4), 1–
    29.
    Sáez, J. A., Galar, M., & Krawczyk, B. (2019). Addressing the Overlapping Data Problem in
    Classification Using the One-vs-One Decomposition Strategy. IEEE Access, 7, 83396–
    83411. https://doi.org/10.1109/ACCESS.2019.2925300
    Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and
    Knowledge Discovery, 8(4). https://doi.org/10.1002/widm.1249
    Salunkhe, U. R., & Mali, S. N. (2016). Classifier Ensemble Design for Imbalanced Data
    Classification: A Hybrid Approach. Procedia Computer Science, 85, 725–732.
    https://doi.org/10.1016/j.procs.2016.05.259
    Schapire, R. E. (2003). The Boosting Approach to Machine Learning: An Overview. Nonlinear
    Estimation and Classification, 149–171. https://doi.org/10.1007/978-0-387-21579-2_9
    Sesmero, M. P., Iglesias, J. A., Magán, E., Ledezma, A., & Sanchis, A. (2021). Impact of the
    learners diversity and combination method on the generation of heterogeneous classifier
    ensembles. Applied Soft Computing, 111, 107689.
    https://doi.org/10.1016/j.asoc.2021.107689
    Shen, F., Zhao, X., Kou, G., & Alsaadi, F. E. (2021). A new deep learning ensemble credit risk
    evaluation model with an improved synthetic minority oversampling technique. Applied
    Soft Computing, 98, 106852. https://doi.org/10.1016/j.asoc.2020.106852
    Spelmen, V. S., & Porkodi, R. (2018). A Review on Handling Imbalanced Data. 2018
    International Conference on Current Trends towards Converging Technologies
    (ICCTCT), 1–11. https://doi.org/10.1109/ICCTCT.2018.8551020
    Sun, B., Chen, H., Wang, J., & Xie, H. (2018). Evolutionary under-sampling based bagging
    ensemble method for imbalanced data classification. Frontiers of Computer Science,
    12(2), 331–350. https://doi.org/10.1007/s11704-016-5306-z
    Sun, J., Lang, J., Fujita, H., & Li, H. (2018). Imbalanced enterprise credit evaluation with DTESBD: Decision tree ensemble based on SMOTE and bagging with differentiated
    sampling rates. Information Sciences, 425, 76–91.
    https://doi.org/10.1016/j.ins.2017.10.017
    Sun, J., Li, H., Fujita, H., Fu, B., & Ai, W. (2020). Class-imbalanced dynamic financial distress
    prediction based on Adaboost-SVM ensemble combined with SMOTE and time
    weighting. Information Fusion, 54, 128–144.
    https://doi.org/10.1016/j.inffus.2019.07.006
    Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in
    classification: Experimental evaluation. Information Sciences, 513, 429–441.
    https://doi.org/10.1016/j.ins.2019.11.004
    Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010). Cost-sensitive learning methods for
    imbalanced data. The 2010 International Joint Conference on Neural Networks (IJCNN),
    1–8. https://doi.org/10.1109/IJCNN.2010.5596486
    Tuv, E. (2006). Ensemble Learning. In I. Guyon, M. Nikravesh, S. Gunn, & L. A. Zadeh (Eds.),
    Feature Extraction: Foundations and Applications (pp. 187–204). Springer.
    https://doi.org/10.1007/978-3-540-35488-8_8
    Tyagi, S., & Mittal, S. (2020). Sampling Approaches for Imbalanced Data Classification
    Problem in Machine Learning. In P. K. Singh, A. K. Kar, Y. Singh, M. H. Kolekar, & S.
    Tanwar (Eds.), Proceedings of ICRIC 2019 (pp. 209–221). Springer International
    Publishing. https://doi.org/10.1007/978-3-030-29407-6_17
    Valentini, G., & Masulli, F. (2002). Ensembles of Learning Machines. In M. Marinaro & R.
    Tagliaferri (Eds.), Neural Nets (pp. 3–20). Springer. https://doi.org/10.1007/3-540-
    45808-5_1
    Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.
    IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421.
    https://doi.org/10.1109/TSMC.1972.4309137
    Wu, Z., Lin, W., & Ji, Y. (2018). An Integrated Ensemble Learning Model for Imbalanced Fault
    Diagnostics and Prognostics. IEEE Access, 6, 8394–8402.
    https://doi.org/10.1109/ACCESS.2018.2807121
    Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE
    and ENN based on Random forest for medical imbalanced data. Journal of Biomedical
    Informatics, 107, 103465. https://doi.org/10.1016/j.jbi.2020.103465
    Zhai, J., Qi, J., & Zhang, S. (2022). Imbalanced data classification based on diverse sample
    generation and classifier fusion. International Journal of Machine Learning and
    Cybernetics, 13(3), 735–750. https://doi.org/10.1007/s13042-021-01321-9
    Zhai, J., Zhang, S., & Wang, C. (2017). The classification of imbalanced large data sets based
    on MapReduce and ensemble of ELM classifiers. International Journal of Machine
    Learning and Cybernetics, 8(3), 1009–1017. https://doi.org/10.1007/s13042-015-0478-
    7
    Zhao, J., Jin, J., Chen, S., Zhang, R., Yu, B., & Liu, Q. (2020). A weighted hybrid ensemble
    method for classifying imbalanced data. Knowledge-Based Systems, 203, 106087.
    https://doi.org/10.1016/j.knosys.2020.106087
    Zhou, X., Hu, Y., Liang, W., Ma, J., & Jin, Q. (2021). Variational LSTM Enhanced Anomaly
    Detection for Industrial Big Data. IEEE Transactions on Industrial Informatics, 17(5),
    3469–3477. https://doi.org/10.1109/TII.2020.3022432
    Zhu, Z., Wang, Z., Li, D., Zhu, Y., & Du, W. (2020). Geometric Structural Ensemble Learning
    for Imbalanced Problems. IEEE Transactions on Cybernetics, 50(4), 1617–1629.
    https://doi.org/10.1109/TCYB.2018.2877663

    QR CODE
    :::