跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳奕嫻
Yi-Hsien Chen
論文名稱: 結合特徵選取與重採樣技術應用於信用風險預測
Combining Feature Selection and Resampling Techniques for Credit Risk Prediction
指導教授: 蔡志豐
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系在職專班
Executive Master of Information Management
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 160
中文關鍵詞: 信用風險特徵選取重採樣不平衡資料機器學習資料探勘
外文關鍵詞: Credit Risk, Feature Selection, Resampling, Imbalanced Data, Machine Learning, Data Mining
相關次數: 點閱:24下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 信用風險管理是銀行的核心議題,精確評估高風險貸款並建立可靠的信用評分模型極為重要。傳統機器學習演算法在處理平衡數據時表現良好,但在面對不平衡的類別分布時,這些模型往往偏向多數類別(即良好信用),而忽略了少數重要的類別(即不良信用)。這種偏差可能導致不良信用被錯誤地分類為良好信用,當這些借款人違約時,金融機構可能面臨巨大的財務損失。

    為了解決不平衡問題,在本研究中結合了特徵選取和重採樣技術,從公開平台收集了五個信用風險數據集,採用了三種特徵選取與八種重採樣技術,並對六種不同的分類器模型進行了廣泛的實驗。通過系統性的比較分析,本研究評估了單獨與組合前處理技術的性能,並探討了不同前處理技術的應用順序對模型預測結果的影響。

    此研究為信用風險管理提供了一種有效的前處理組合策略,即先進行重採樣平衡資料集後,再進行特徵選取選出具代表性的特徵,相較於單一技術的應用,能夠有效提升模型的預測效能,特別是在小規模且高度不平衡的數據集中效果更為優秀,該策略有助於改進信用評分模型,從而更精確地識別和處理高風險貸款。


    Credit risk management is a core issue for banks, and accurately assessing high-risk loans and establishing reliable credit scoring models is extremely important. Traditional machine learning algorithms perform well with balanced data, but when facing imbalanced class distributions, these models tend to favor the majority class (i.e., good credit) while neglecting the minority important class (i.e., poor credit). This bias could lead to misclassification of poor credit as good credit, potentially causing significant financial losses for financial institutions when these borrowers default.

    To solve the imbalance issue, this study combined feature selection and resampling techniques, collecting five credit risk datasets from public platforms. It employed three feature selection methods and eight resampling techniques, and conducted extensive experiments on six different classifier models. Through systematic comparative analysis, this study evaluated the performance of individual and combined preprocessing techniques and explored the impact of the order of these techniques on the model prediction results.

    This research offers an effective preprocessing combination strategy for credit risk, which involves first resampling to balance the dataset and then selecting representative features through feature selection. Compared to the application of a single technique, this strategy can effectively enhance the predictive performance of models, especially in small and highly imbalanced datasets. This strategy contributes to the improvement of credit models, thereby enabling more accurate identification and management of high-risk loans.

    表目錄 i 圖目錄 iv 1 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究目的 5 1.4 研究流程 6 2 第二章 文獻探討 8 2.1 重採樣之相關研究 8 2.2 特徵選取之相關研究 12 2.3 重採樣與特徵選取之相關研究 15 2.4 小結 18 3 第三章 研究方法 20 3.1 實驗流程 20 3.2 資料來源 24 3.3 資料前處理 27 3.4 重採樣技術 27 3.4.1 合成少數類過採樣技術(Synthetic Minority Oversampling Technique, SMOTE) 28 3.4.2 邊界合成少數過採樣技術(Borderline-SMOTE) 29 3.4.3 自適應合成採樣(Adaptive Synthetic Sampling, ADASYN) 30 3.4.4 集群中心少數類樣本(Cluster Centroid) 32 3.4.5 編輯最近鄰(Edited Nearest Neighbors, ENN) 33 3.4.6 Tomek Link 33 3.4.7 SMOTE-Tomek 34 3.4.8 SMOTE-ENN 35 3.5 特徵選取 36 3.5.1 信息增益(Information Gain, IG) 36 3.5.2 基因演算法(Genetic Algorithm, GA) 37 3.5.3 決策樹(Decision tree, DT) 38 3.6 機器學習技術 39 3.6.1 邏輯回歸(Logistic Regression, LR) 39 3.6.2 K最近鄰(K-Nearest Neighbors, KNN) 40 3.6.3 支援向量機(Support Vector Machine, SVM) 40 3.6.4 隨機森林(Random Forest, RF) 41 3.6.5 極端梯度提升(Extreme Gradient Boosting, XGBoost) 42 3.6.6 引導聚集算法(Bootstrap Aggregating, Bagging) 43 3.7 效能評估 44 4 第四章 實驗結果與分析 47 4.1 Baseline模型 47 4.2 實驗一 50 4.3 實驗二 52 4.4 實驗三 54 4.5 實驗四 56 4.6 實驗結果與評估 58 5 第五章 結論與建議 61 5.1 研究結論 61 5.2 未來研究方向與建議 62 6 參考文獻 63 7 附錄 68

    Agustina Pertiwi, D. A., Ahmad, K., Nikmah, T. L., Alamsyah, Prasetiyo, B., & Muslim, M. A. (2023). Combination of Stacking with Genetic Algorithm Feature Selection to Improve Default Prediction in P2P Lending. 2023 5th International Conference on Cybernetics and Intelligent System (ICORIS), 1–5.
    Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An Investigation of Credit Card Default Prediction in the Imbalanced Datasets. IEEE Access, 8, 201173–201198.
    Altman, N. S. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician, 46(3), 175–185.
    Ayogu, I. I., Popoọla, O. S., Mebawọndu, Ọlamatanmi Josephine, Ugwu, C. C., & Adetunmbi, A. O. (2022). Performance Evaluation of Feature Selection Techniques for Credit Default Prediction. 2022 IEEE Nigeria 4th International Conference on Disruptive Technologies for Sustainable Development (NIGERCON), 1–5.
    Batista, G. E. A. P. A., Bazzan, A., & Monard, M. C. (2003). Balancing Training Data for Automated Annotation of Keywords: A Case Study. WOB.
    Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.
    Ben-Bassat, M. (1982). Pattern Recognition and Reduction of Dimensionality. In Handbook of Statistics (Vol. 2, pp. 773–791). Elsevier.
    Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
    Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
    Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
    Cao, P., Zhao, D., & Zaiane, O. (2013). An Optimized Cost-Sensitive SVM for Imbalanced Data Learning. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 280–292). Springer.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
    Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
    Cox, D. R. (1958). The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232.
    Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1), 131–156.
    Durand, D. (1941). Risk Elements in Consumer Instalment Financing. NBER.
    Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239.
    Han, J., Pei, J., & Tong, H. (2022). Data Mining: Concepts and Techniques. Morgan Kaufmann.
    He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328.
    He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
    Jiang, Z., Su, J., & Zhou, L. (2023). Credit default prediction based on genetic algorithm improved neural network. 2023 9th International Conference on Systems and Informatics (ICSAI), 1–5.
    Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2018). Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3573–3587.
    Li, K., Zhang, W., Lu, Q., & Fang, X. (2014). An Improved SMOTE Imbalanced Data Classification Method Based on Support Degree. 2014 International Conference on Identification, Information and Knowledge in the Internet of Things, 34–38.
    Liu, X., & Huang, J. (2012). Genetic algorithm-based feature selection method for credit risk analysis. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology, 2233–2236.
    Lv, M., Ren, Y., & Chen, Y. (2019). Research on imbalanced data: Based on SMOTE-AdaBoost algorithm. 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE), 1165–1170.
    Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559–569.
    Pozzolo, A. D., Caelen, O., Johnson, R. A., & Bontempi, G. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. 2015 IEEE Symposium Series on Computational Intelligence, 159–166.
    Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
    Rawat, S. S., & Kumar Mishra, A. (2023). The Best ML Classifier(s): An empirical study on the learning of imbalanced and resampled credit card data. 2023 Second International Conference on Informatics (ICI), 1–6.
    Shamsudin, H., Yusof, U. K., Jayalakshmi, A., & Akmal Khalid, M. N. (2020). Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. 2020 IEEE 16th International Conference on Control & Automation (ICCA), 803–808.
    Shi, X., Kong, F., & Li, H. (2021). Research on Credit Evaluation Model for High-Dimensional Imbalanced Data. 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), 162–166.
    Song, Y., & Peng, Y. (2019). A MCDM-Based Evaluation Approach for Imbalanced Classification Methods in Financial Risk Prediction. IEEE Access, 7, 84897–84906.
    Sun, Y., Kamel, M. S., Wong, A. K. C., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358–3378.
    TOMEK, I. (1976). TWO MODIFICATIONS OF CNN. TWO MODIFICATIONS OF CNN.
    Veganzones, D., & Séverin, E. (2018). An investigation of bankruptcy prediction in imbalanced datasets. Decision Support Systems, 112, 111–124.
    Wang, H., Liang, Q., Hancock, J. T., & Khoshgoftaar, T. M. (2023). Enhancing Credit Card Fraud Detection Through a Novel Ensemble Feature Selection Technique. 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), 121–126.
    Wasikowski, M., & Chen, X. (2010). Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388–1400.
    Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421.
    Wu, Y., Xie, Z., Ji, S., Liu, Z., Zhang, X., Lin, C., Deng, S., Zhou, J., Wang, T., & Beyah, R. (2023). Fraud-Agents Detection in Online Microfinance: A Large-Scale Empirical Study. IEEE Transactions on Dependable and Secure Computing, 20(2), 1169–1185.
    Yang, J., & Honavar, V. (1998). Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and Their Applications, 13(2), 44–49.
    Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3, Part 1), 5718–5727.
    Zhang, H., He, H., & Zhang, W. (2018). Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring. Neurocomputing, 316, 210–221.
    Zhong, Y., & Wang, H. (2023). Internet Financial Credit Scoring Models Based on Deep Forest and Resampling Methods. IEEE Access, 11, 8689–8700.
    金管會(2023). 金管會公布金融業運用人工智慧(AI)之核心原則及政策.

    QR CODE
    :::