跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃星瑋
Hsing-Wei Huang
論文名稱: 正規化與變數篩選在破產領域的適用性研究
指導教授: 蘇坤良
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 88
中文關鍵詞: 機器學習破產分析正規化類別不平衡變數篩選
外文關鍵詞: Machine Learning, Bankruptcy Prediction, Normalize, Class Imbalance, Feature Selection
相關次數: 點閱:12下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在破產分析的領域中,一定會遇到類別不平衡的問題。因為在現實世界中,破產公司的數量一定會比非破產公司少,在過去都是依靠傳統的統計方法或是個人直覺,來判斷是否要將金額借款給其他公司,但這樣往往令公司面臨破產的危機。許多學者紛紛開始使用機器學習法來解決諸如此類的問題,希望能提供銀行公司一個準確的分類模型,讓分類器來自行判斷是否要將金錢借出,如此一來便能減少公司破產的機率。
    許多機器學習演算法在建立模型時,都會進行內建的正規化,因為正規化不但能減少分類器的訓練時間,也能讓使資料更容易閱讀,許多學者在進行研究時,都會註明該篇研究是否有將破產公司資料集進行正規化,但卻沒有研究是關於,在破產領域裡,是否正規化一定能讓分類結果提升,又或是不同的類別不平衡比率的資料集和變數篩選的方法,是否會影響正規化的適用性。
    本研究將台灣與大陸這兩份真實的資料,模擬成五種類別不平衡的比率,分別為 1、2、5、10 和 20,再比較正規化前與正規化後,是否會對不同的分類器而有不同的影響,藉此探討在破產領域裡,正規化在不同不平衡比率的適用性,此外本研究也會使用三種變數篩選的方法,分別為 GA、CART 與 Information Gain 來探討變數篩選在不同不平衡的比率下對正規化的影響,希望能藉此了解正規化是否真的適用於破產領域。


    In the field of bankruptcy prediction, it will definitely to face the class imbalance. Because in the real world, the amount of bankruptcy companies will be actually less than the non-bankruptcy companies. In the past, it was all relying on traditional statistical methods or personal intuition to determine whether to lend the money to other companies or not, but this often put the company in a crisis of bankruptcy. Many researches have begun to use machine learning to solve such problems, hoping to provide an accurate classification model for bank companies.
    Many scholars will indicate whether their study has normalized the bankruptcy data or not. However, no research concerned about whether normalize can improve the classification results. In our study, we make the two real data into five categories of imbalances ratios: 1,2,5,10,20 respectively. By this way, we will know the relation of imbalance ratios and normalize. Furthermore, our study will also consider about feature selection. Hopes to learn whether normalization really applies to bankruptcy prediction or not.

    摘要... i Abstract ii 誌謝... iii 目錄... iv 圖目錄... vi 表目錄... viii 一、緒論... 1 1-1 研究背景... 1 1-2 研究動機... 2 1-3 研究目的... 3 1-4 研究架構... 4 二、文獻探討... 5 2-1 類別不平衡... 5 2-2 解決類別不平衡問題... 6 2-2-1 減少多數法... 6 2-2-2 增加少數法... 7 2-3 分類器... 7 2-3-1 Naïve Bayes單純貝式分類器... 8 2-3-2 支援向量機(Support Vector Machine, SVM)... 8 2-3-3 決策樹(Decision Tree, DT)... 10 2-3-4 類神經網路(Artificial Neural Network, ANN)... 12 2-4 變數篩選(Feature Selection, FS)... 13 2-4-2 基因演算法(Genetic Algorithm, GA)... 14 2-4-3 資訊獲利(Information Gain)... 15 2-4-4 CART決策樹(Decision Tree CART, DT)... 17 2-5 正規化... 17 2-6 評估指標... 19 2-6-1 AUC(Area Under ROC Curve)... 20 2-6-2 Type II error 21 2-7 相關文獻... 22 2-8 變數篩選相關文獻摘要與比較... 23 三、研究方法... 26 3-1 資料集... 26 3-2 研究一 正規化的有無在不同不平衡比率下的影響... 27 3-2-1 10折交叉驗算... 28 3-2-2 衡量準則... 29 3-3 研究二 變數篩選與正規化的探討與研究... 30 四、實驗結果... 32 4-1 正規化的有無在不同不平衡比率下的影響... 35 4-1-1 大陸資料與正規化的比較和分析... 36 4-1-2 台灣資料與正規化的比較和分析... 43 4-1-3 類別不平衡比率對正規化的小結論... 50 4-2 變數篩選與正規化的探討與研究... 52 4-2-1 變數篩選與正規化的順序研究... 53 4-2-2 有無變數篩選在不同不平衡比率下對正規化的影響... 55 4-3 類別平衡與原始資料的分類結果比較... 59 4-4 最佳前處理方式驗證... 62 五、結論... 66 5-1 結論與貢獻... 66 5-2 後續研究... 68 六、參考資料... 69 七、附錄... 73

    [1]. Tsai, C. F., Lu, Y. H., Hung, Y. C., & Yen, D. C. (2016). Intangible assets evaluation: The machine learning perspective. Neurocomputing, 175, 110-120.
    [2]. Olson, D. L., Delen, D., & Meng, Y. (2012). Comparative analysis of data mining methods for bankruptcy prediction. Decision Support Systems, 52(2), 464-473.
    [3]. Koutanaei, F. N., Sajedi, H., & Khanbabaei, M. (2015). A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring. Journal of Retailing and Consumer Services, 27, 11-23.
    [4]. Zhou, L., Lu, D., & Fujita, H. (2015). The performance of corporate financial distress prediction models with features selection guided by domain knowledge and data mining approaches. Knowledge-Based Systems, 85, 52-61.
    [5]. Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16-25.
    [6]. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29
    [7]. Kim, H. J., Jo, N. O., & Shin, K. S. (2016). Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Systems with Applications, 59, 226-234.
    [8]. Piri, S., Delen, D., & Liu, T. (2017). A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decision Support Systems.
    [9]. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    [10]. Barboza, F., Kimura, H., & Altman, E. (2017). Machine learning models and bankruptcy prediction. Expert Systems with Applications, 83, 405-417.
    [11]. Zhou, L., Lai, K. K., & Yen, J. (2014). Bankruptcy prediction using SVM models with a new approach to combine features selection and parameter optimisation. International Journal of Systems Science, 45(3), 241-253.
    [12]. Zanaty, E. A. (2012). Support vector machines (SVMs) versus multilayer perception (MLP) in data classification. Egyptian Informatics Journal, 13(3), 177-183.
    [13]. Tsai, C. F., Lu, Y. H., Hung, Y. C., & Yen, D. C. (2016). Intangible assets evaluation: The machine learning perspective. Neurocomputing, 175, 110-120.
    [14]. Saeys, Y., Inza, I., & Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. bioinformatics, 23(19), 2507-2517.
    [15]. Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441-453.
    [16]. Lin, F., Liang, D., Yeh, C. C., & Huang, J. C. (2014). Novel feature selection methods to financial distress prediction. Expert Systems with Applications, 41(5), 2472-2483.
    [17]. Tsai, C. F. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems, 22(2), 120-127.
    [18]. Gordini, N. (2014). A genetic algorithm approach for SMEs bankruptcy prediction: Empirical evidence from Italy. Expert Systems with Applications, 41(14), 6433-6445.
    [19]. Tsai, C. F., Eberle, W., & Chu, C. Y. (2013). Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, 240-247.
    [20]. Soufan, O., Kleftogiannis, D., Kalnis, P., & Bajic, V. B. (2015). DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PloS one, 10(2), e0117988.
    [21]. Chen, H., Jiang, W., Li, C., & Li, R. (2013). A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm. Mathematical problems in Engineering, 2013.
    [22]. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
    [23]. Liu, X. Y., & Zhou, Z. H. (2013). Ensemble methods for class imbalance learning. Imbalanced Learning: Foundations, Algorithms, and Applications, 61-82.
    [24]. Olson, D. L., Delen, D., & Meng, Y. (2012). Comparative analysis of data mining methods for bankruptcy prediction. Decision Support Systems, 52(2), 464-473.
    [25]. Liang, D., Tsai, C. F., & Wu, H. T. (2015). The effect of feature selection on financial distress prediction. Knowledge-Based Systems, 73, 289-297.
    [26]. Zięba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications, 58, 93-101.
    [27]. Jadhav, S., He, H., & Jenkins, K. (2018). Information Gain Directed Genetic Algorithm Wrapper Feature selection for Credit Rating. Applied Soft Computing.
    [28]. Naseriparsa, M., Bidgoli, A. M., & Varaee, T. (2014). A hybrid feature selection method to improve performance of a group of classification algorithms. arXiv preprint arXiv:1403.2372.
    [29]. Yoo, J. K. (2018). Partial least squares fusing unsupervised learning. Chemometrics and Intelligent Laboratory Systems, 175, 82-86.
    [30]. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
    [31]. Zhou, L. (2013). Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, 41, 16-25.
    [32]. Liang, D., Lu, C. C., Tsai, C. F., & Shih, G. A. (2016). Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. European Journal of Operational Research, 252(2), 561-572.
    [33]. Brown, I. (2012). An experimental comparison of classification techniques for imbalanced credit scoring data sets using SASÒ Enterprise Miner. In Proceedings of SAS Global Forum.
    [34]. Lee, Y. C. (2007). Application of support vector machines to corporate credit rating prediction. Expert Systems with Applications, 33(1), 67-74.
    [35]. García, V., Sánchez, J. S., & Mollineda, R. A. (2012). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25(1), 13-21.
    [36]. Hosmer DW, Lemeshow S (2000). Applied logistic regression, 2nd ed. Wiley, 156-164
    [37]. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
    [38]. Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia, 18.
    [39]. Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010, May). Feature selection: An ever evolving frontier in data mining. In Feature Selection in Data Mining (pp. 4-13).
    [40]. Elrahman, S. M. A., & Abraham, A. (2013). A review of class imbalance problem. Journal of Network and Innovative Computing, 1(2013), 332-340.
    [41]. Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61-68.
    [42]. Kumar, G., & Roy, S. (2016, December). Development of hybrid boosting technique for bankruptcy prediction. In Information Technology (ICIT), 2016 International Conference on (pp. 248-253). IEEE.
    [43]. Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
    [44]. Al Shalabi, L., & Shaaban, Z. (2006, May). Normalization as a preprocessing engine for data mining and the approach of preference matrix. In Dependability of Computer Systems, 2006. DepCos-RELCOMEX'06. International Conference on (pp. 207-214). IEEE.

    QR CODE
    :::