| 研究生: |
王珮庭 Pei-Ting Wang |
|---|---|
| 論文名稱: |
同質性與異質性集成式重採樣方法於類別不平衡問題之研究 Homogeneous and Heterogeneous Ensemble Resampling Approaches for the Class Imbalance Problem |
| 指導教授: |
蔡志豐
Chih-Fong Tsai |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 133 |
| 中文關鍵詞: | 資料探勘 、類別不平衡 、集成式學習 |
| 外文關鍵詞: | data mining, class imbalance, ensemble learning |
| 相關次數: | 點閱:4 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘領域中,資料的收集往往伴隨著各種資料品質的問題,包括:數據含有重複值 duplicate values 、遺漏值 (missing values)、離群值 (outlier)、資料格式不一 (data inconsistency)等問題,這些問題也間接影響提取有用資訊的困難度。此外,由於現實世界所發生的機率不同,類別不平衡問題(Class Imbalance)也成為資料探勘中一個很重要的課題,此問題會導致在模型預測和分類中,對少數類別的預測性能下降,並對資料分析的準確性和可靠性上產生負面影響。
因此,本論文主要探討類別不平衡問題。根據過往文獻,本研究以資料層級方法,彈性搭配不同分類演算法方式,來對類別不平衡資料集進行重採樣,探討在不同重採樣下,調整類別大小類別比例是否影響分類性能。另外,由於現有文獻中並未提出將不同重採樣所訓練的單一分類器進行集成建立成多重分類器,以及將不同重採樣樣本進行合併,搭配單一分類器或集成式分類器。因此,本研究以集成式方法(Ensemble Method)為基礎,提出同質性(Homogeneous)和異質性(Heterogeneous)方法,探討在不同處理流程下,哪種組合方式可以更好的處理類別不平衡問題。
本研究透過實驗結果,證明在資料前處理方法中以資料層級方法對類別不平衡資料集進行重採樣能有效改善分類表現,且重採樣的大小類別平衡比例對分類器表現有顯著的影響。而在全面比較同質性與異質性方法中,多重分類器和樣本合併方法的單一分類器與集成式分類器,在統計結果中並無差異性。但異質性方法相對於同質性方法,更能夠在不同分類演算法上發掘出最佳的搭配方式,提升分類準確率(AUC)。這些實驗結果為後續研究者提供可進一步拓展與改進集成式分類器的方向,並為解決類別不平衡問題提供更多的選擇和優化策略。
In the field of data mining, data collection often comes with various data quality issues, including duplicate values, missing values, outliers, and data inconsistency, which indirectly affect the difficulty of extracting useful information. Furthermore, the class imbalance has become an important issue in data mining due to the different probabilities of events in the real world. This problem leads to decreased predictive performance for minority classes in model prediction and classification, negatively impacting the accuracy and reliability of data analysis.
Therefore, this paper focuses on addressing the class imbalance problem. Based on previous literature, this study employs data-level approaches and flexibly combines different
classification algorithms to resample class-imbalanced datasets. It explores whether adjusting the class proportions under different resampling techniques affects the classification performance. Moreover, since existing literature does not propose the integration of individual classifiers trained with different resampling techniques to build multiple classifiers or merging different resampled samples with single classifiers or ensemble classifiers, this research proposes homogeneous and heterogeneous methods based on ensemble methods to explore which combination approach can better handle class imbalance problems under different processing flows.
Through experimental results, this study demonstrates that resampling class-imbalanced datasets using data-level techniques in data preprocessing can effectively improve classification performance, and the balance ratio of resampled minority and majority classes significantly influences classifier performance. In the comprehensive comparison between homogeneous and heterogeneous methods, there is no statistical difference between multiple classifiers and the single classifier or ensemble classifier using sample merging. However, heterogeneous methods, compared to homogeneous methods, are more capable of exploring the best combinations with different classification algorithms to enhance classification accuracy (AUC). These experimental results provide directions for further expansion and improvement of ensemble classifiers and offer more choices and optimization strategies for addressing class imbalance problems.
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several
methods for balancing machine learning training data. ACM SIGKDD Explorations
Newsletter, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
https://doi.org/10.1007/BF00058655
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16,
321–357. https://doi.org/10.1613/jair.953
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving
Prediction of the Minority Class in Boosting. In N. Lavrač, D. Gamberger, L.
Todorovski, & H. Blockeel (Eds.), Knowledge Discovery in Databases: PKDD 2003
(pp. 107–119). Springer. https://doi.org/10.1007/978-3-540-39804-2_12
Chongomweru, H., & Kasem, A. (2021). A novel ensemble method for classification in
imbalanced datasets using split balancing technique based on instance hardness
(sBal_IH). Neural Computing and Applications, 33(17), 11233–11254.
https://doi.org/10.1007/s00521-020-05570-7
Dudjak, M., & Martinović, G. (2021). An empirical study of data intrinsic characteristics that
make learning from imbalanced data difficult. Expert Systems with Applications, 182,
115297. https://doi.org/10.1016/j.eswa.2021.115297
Ebrahimi Shahabadi, M. S., Tabrizchi, H., Kuchaki Rafsanjani, M., Gupta, B. B., & Palmieri,
F. (2021). A combination of clustering-based under-sampling with ensemble methods
for solving imbalanced class problem in intelligent systems. Technological Forecasting
and Social Change, 169, 120796. https://doi.org/10.1016/j.techfore.2021.120796
Elkan, C. (2001). The foundations of cost-sensitive learning. Proceedings of the 17th
International Joint Conference on Artificial Intelligence - Volume 2, 973–978.
Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer
diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089.
https://doi.org/10.1016/j.jbi.2018.12.003
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on
Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based
Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C
(Applications and Reviews), 42(4), 463–484.
https://doi.org/10.1109/TSMCC.2011.2161285
García, V., Sánchez, J. S., & Mollineda, R. A. (2012). On the effectiveness of preprocessing
methods when dealing with different levels of class imbalance. Knowledge-Based
Systems, 25(1), 13–21. https://doi.org/10.1016/j.knosys.2011.06.013
Gong, J., & Kim, H. (2017). RHSBoost: Improving classification performance in imbalance
data. Computational Statistics & Data Analysis, 111, 1–13.
https://doi.org/10.1016/j.csda.2017.01.005
Guo, H., Zhou, J., & Wu, C. (2020). Ensemble learning via constraint projection and
undersampling technique for class-imbalance problem. Soft Computing, 24(7), 4711–
4727. https://doi.org/10.1007/s00500-019-04501-6
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning
from class-imbalanced data: Review of methods and applications. Expert Systems with
Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035
He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on
Knowledge and Data Engineering, 21(9), 1263–1284.
https://doi.org/10.1109/TKDE.2008.239
Hoyos-Osorio, J., Alvarez-Meza, A., Daza-Santacoloma, G., Orozco-Gutierrez, A., &
Castellanos-Dominguez, G. (2021). Relevant information undersampling to support
imbalanced data classification. Neurocomputing, 436, 136–146.
https://doi.org/10.1016/j.neucom.2021.01.033
Huda, S., Liu, K., Abdelrazek, M., Ibrahim, A., Alyahya, S., Al-Dossari, H., & Ahmad, S.
(2018). An Ensemble Oversampling Model for Class Imbalance Problem in Software
Defect Prediction. IEEE Access, 6, 24184–24195.
https://doi.org/10.1109/ACCESS.2018.2817572
Kaur, P., & Gosain, A. (2018). Comparing the Behavior of Oversampling and Undersampling
Approach of Class Imbalance Learning by Combining Class Imbalance Problem with
Noise. In A. K. Saini, A. K. Nayak, & R. K. Vyas (Eds.), ICT Based Innovations (pp.
23–30). Springer. https://doi.org/10.1007/978-981-10-6602-3_3
Kim, Y., Lee, Y., & Jeon, M. (2021). Imbalanced image classification with complement cross
entropy. Pattern Recognition Letters, 151, 33–40.
https://doi.org/10.1016/j.patrec.2021.07.017
Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions.
Progress in Artificial Intelligence, 5(4), 221–232. https://doi.org/10.1007/s13748-016-
0094-0
Lango, M., & Stefanowski, J. (2018). Multi-class and feature selection extensions of Roughly
Balanced Bagging for imbalanced data. Journal of Intelligent Information Systems,
50(1), 97–127. https://doi.org/10.1007/s10844-017-0446-7
Lee, W., Jun, C.-H., & Lee, J.-S. (2017). Instance categorization by support vector machines to
adjust weights in AdaBoost for imbalanced data classification. Information Sciences,
381, 92–103. https://doi.org/10.1016/j.ins.2016.11.014
Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in
class-imbalanced data. Information Sciences, 409–410, 17–26.
https://doi.org/10.1016/j.ins.2017.05.008
Liu, Y., Yang, G., Qiao, S., Liu, M., Qu, L., Han, N., Wu, T., Yuan, G., Wu, T., & Peng, Y. (2023).
Imbalanced data classification: Using transfer learning and active sampling.
Engineering Applications of Artificial Intelligence, 117, 105621.
https://doi.org/10.1016/j.engappai.2022.105621
Lu, W., Li, Z., & Chu, J. (2017). Adaptive Ensemble Undersampling-Boost: A novel learning
framework for imbalanced data. Journal of Systems and Software, 132, 272–282.
https://doi.org/10.1016/j.jss.2017.07.006
Lu, Y., Cheung, Y., & Tang, Y. Y. (2016). Hybrid Sampling with Bagging for Class Imbalance
Learning. In J. Bailey, L. Khan, T. Washio, G. Dobbie, J. Z. Huang, & R. Wang (Eds.),
Advances in Knowledge Discovery and Data Mining (pp. 14–26). Springer International
Publishing. https://doi.org/10.1007/978-3-319-31753-3_2
Lu, Y., Cheung, Y.-M., & Tang, Y. Y. (2020). Bayes Imbalance Impact Index: A Measure of
Class Imbalanced Data Set for Classification Problem. IEEE Transactions on Neural
Networks and Learning Systems, 31(9), 3525–3539.
https://doi.org/10.1109/TNNLS.2019.2944962
Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M.-S., & Zeineddine, H. (2019). An
Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud
Detection. IEEE Access, 7, 93010–93022.
https://doi.org/10.1109/ACCESS.2019.2927266
Mehta, S., & Patnaik, K. S. (2021). Improved prediction of software defects using ensemble
machine learning techniques. Neural Computing and Applications, 33(16), 10551–
10562. https://doi.org/10.1007/s00521-021-05811-3
Mienye, I. D., & Sun, Y. (2022). A Survey of Ensemble Learning: Concepts, Algorithms,
Applications, and Prospects. IEEE Access, 10, 99129–99149.
https://doi.org/10.1109/ACCESS.2022.3207287
Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A
review. International Journal of Computing and Business Research (IJCBR), 5(4), 1–
29.
Sáez, J. A., Galar, M., & Krawczyk, B. (2019). Addressing the Overlapping Data Problem in
Classification Using the One-vs-One Decomposition Strategy. IEEE Access, 7, 83396–
83411. https://doi.org/10.1109/ACCESS.2019.2925300
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and
Knowledge Discovery, 8(4). https://doi.org/10.1002/widm.1249
Salunkhe, U. R., & Mali, S. N. (2016). Classifier Ensemble Design for Imbalanced Data
Classification: A Hybrid Approach. Procedia Computer Science, 85, 725–732.
https://doi.org/10.1016/j.procs.2016.05.259
Schapire, R. E. (2003). The Boosting Approach to Machine Learning: An Overview. Nonlinear
Estimation and Classification, 149–171. https://doi.org/10.1007/978-0-387-21579-2_9
Sesmero, M. P., Iglesias, J. A., Magán, E., Ledezma, A., & Sanchis, A. (2021). Impact of the
learners diversity and combination method on the generation of heterogeneous classifier
ensembles. Applied Soft Computing, 111, 107689.
https://doi.org/10.1016/j.asoc.2021.107689
Shen, F., Zhao, X., Kou, G., & Alsaadi, F. E. (2021). A new deep learning ensemble credit risk
evaluation model with an improved synthetic minority oversampling technique. Applied
Soft Computing, 98, 106852. https://doi.org/10.1016/j.asoc.2020.106852
Spelmen, V. S., & Porkodi, R. (2018). A Review on Handling Imbalanced Data. 2018
International Conference on Current Trends towards Converging Technologies
(ICCTCT), 1–11. https://doi.org/10.1109/ICCTCT.2018.8551020
Sun, B., Chen, H., Wang, J., & Xie, H. (2018). Evolutionary under-sampling based bagging
ensemble method for imbalanced data classification. Frontiers of Computer Science,
12(2), 331–350. https://doi.org/10.1007/s11704-016-5306-z
Sun, J., Lang, J., Fujita, H., & Li, H. (2018). Imbalanced enterprise credit evaluation with DTESBD: Decision tree ensemble based on SMOTE and bagging with differentiated
sampling rates. Information Sciences, 425, 76–91.
https://doi.org/10.1016/j.ins.2017.10.017
Sun, J., Li, H., Fujita, H., Fu, B., & Ai, W. (2020). Class-imbalanced dynamic financial distress
prediction based on Adaboost-SVM ensemble combined with SMOTE and time
weighting. Information Fusion, 54, 128–144.
https://doi.org/10.1016/j.inffus.2019.07.006
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in
classification: Experimental evaluation. Information Sciences, 513, 429–441.
https://doi.org/10.1016/j.ins.2019.11.004
Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010). Cost-sensitive learning methods for
imbalanced data. The 2010 International Joint Conference on Neural Networks (IJCNN),
1–8. https://doi.org/10.1109/IJCNN.2010.5596486
Tuv, E. (2006). Ensemble Learning. In I. Guyon, M. Nikravesh, S. Gunn, & L. A. Zadeh (Eds.),
Feature Extraction: Foundations and Applications (pp. 187–204). Springer.
https://doi.org/10.1007/978-3-540-35488-8_8
Tyagi, S., & Mittal, S. (2020). Sampling Approaches for Imbalanced Data Classification
Problem in Machine Learning. In P. K. Singh, A. K. Kar, Y. Singh, M. H. Kolekar, & S.
Tanwar (Eds.), Proceedings of ICRIC 2019 (pp. 209–221). Springer International
Publishing. https://doi.org/10.1007/978-3-030-29407-6_17
Valentini, G., & Masulli, F. (2002). Ensembles of Learning Machines. In M. Marinaro & R.
Tagliaferri (Eds.), Neural Nets (pp. 3–20). Springer. https://doi.org/10.1007/3-540-
45808-5_1
Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.
IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421.
https://doi.org/10.1109/TSMC.1972.4309137
Wu, Z., Lin, W., & Ji, Y. (2018). An Integrated Ensemble Learning Model for Imbalanced Fault
Diagnostics and Prognostics. IEEE Access, 6, 8394–8402.
https://doi.org/10.1109/ACCESS.2018.2807121
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE
and ENN based on Random forest for medical imbalanced data. Journal of Biomedical
Informatics, 107, 103465. https://doi.org/10.1016/j.jbi.2020.103465
Zhai, J., Qi, J., & Zhang, S. (2022). Imbalanced data classification based on diverse sample
generation and classifier fusion. International Journal of Machine Learning and
Cybernetics, 13(3), 735–750. https://doi.org/10.1007/s13042-021-01321-9
Zhai, J., Zhang, S., & Wang, C. (2017). The classification of imbalanced large data sets based
on MapReduce and ensemble of ELM classifiers. International Journal of Machine
Learning and Cybernetics, 8(3), 1009–1017. https://doi.org/10.1007/s13042-015-0478-
7
Zhao, J., Jin, J., Chen, S., Zhang, R., Yu, B., & Liu, Q. (2020). A weighted hybrid ensemble
method for classifying imbalanced data. Knowledge-Based Systems, 203, 106087.
https://doi.org/10.1016/j.knosys.2020.106087
Zhou, X., Hu, Y., Liang, W., Ma, J., & Jin, Q. (2021). Variational LSTM Enhanced Anomaly
Detection for Industrial Big Data. IEEE Transactions on Industrial Informatics, 17(5),
3469–3477. https://doi.org/10.1109/TII.2020.3022432
Zhu, Z., Wang, Z., Li, D., Zhu, Y., & Du, W. (2020). Geometric Structural Ensemble Learning
for Imbalanced Problems. IEEE Transactions on Cybernetics, 50(4), 1617–1629.
https://doi.org/10.1109/TCYB.2018.2877663