| 研究生: |
龔健生 Chien-Shen Kung |
|---|---|
| 論文名稱: |
分類技術於類別不平衡資料集之研究 |
| 指導教授: | 蔡志豐 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系在職專班 Executive Master of Information Management |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 中文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 資料探勘 、類別不平衡問題 、接收者操作特徵曲線 、曲線下面積 |
| 外文關鍵詞: | Data Mining, Class Imbalanced Problem, ROC, AUC |
| 相關次數: | 點閱:15 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現實的生活所產生的二元分類數據中,大多都存在著不平衡的問題,如:破產資訊、罹患罕見疾病、因意外造成傷亡等。傳統的二元分類演算法,大多在訓練分類器的過程中,常會因類別不平衡而產生預測的偏差進而影響到分類的正確率,其結果也往往會偏向多數類樣本。近年來,學者及研究人員針對類別不平衡問題也提出了相當多的解決方式,卻沒有相關的研究篩選出較適用的基底分類器。
本研究希望能透過所提出的研究架構,並使用KEEL網站上研究二元分類問題的44個不同比例資料集進行實驗,籍此找出較適用於研究類別不平衡問題的基底分類器,提供學者及研究人員參考。
In our daily life, most of the datasets possess the class imbalance problem, in which one class contains a very large number of data samples whereas another class for a very small number of data samples. On example is bankruptcy information, suffering from rare diseases, due to accidental casualties and so on. In the process of training a classifier, the traditional binary classification algorithms will generate prediction bias because of class imbalanced datasets, and the results also tend to favor the majority class samples. In recent years, a considerable number of scholars raised many solutions for solving the class imbalanced problem.
In this study, different from related works that proposing novel algorithms to enhance the performances of existing classification techniques, we focus on finding out the best baseline classifier for the class imbalance domain problem. The finding of this study is able to provide the guideline for future research to compare their novel algorithms to the identified baseline classifier.
The experiments are based on 44 various domain datasets containing different imbalance ratios and three popular classifiers, i.e. J48, MLP, and SVM are constructed and compared. Moreover, classifier ensembles by the bagging and boosting method are also developed. The results show that the bagging based MLP classifier ensembles perform the best in terms of the AUC rate.
【中文文獻】
1. 林明潔,董子毅,「危險評估中 ROC 曲線在預測 2×2 表上與敏感度及特異度之關係」,亞洲家庭暴力與性侵害期刊,第四卷第二期,2008,64 -74。
2. 洪振富(2010),「距離式特徵於資料自動分類之研究」國立中央大學資訊管理學系碩士論文。
3. 張琦、吳斌、王柏 (2005),「非平衡數據訓練方法概述」,計算機科學,第三二卷,第十期,第 181-186 頁。
4. 凌士雄 (2004),非對稱性分類分析解決策略之效能比較,碩士論文,國立中山大學資訊管理學系,高雄。
5. 蘇昭安(2003),應用倒傳遞類神經網路在颱風波浪預報之研究,國立臺灣大學工程科學與海洋工程學系碩士論文。
【英文文獻】
1. A. Fernández, S. García, M.J. del Jesus, and F. Herrera, (2008), “A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets,” Fuzzy Sets System, Vol. 159, pp. 2378-2398.
2. A. Fernández, M.J. Del Jesus, and F. Herrera, (2009), “On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Systems with Applications, Vol. 36, pp. 9805-9812.
3. Barandela R, Rangel E, Sánchez JS, FerriFJ (2003) ,“Restricted decontamination forthe imbalanced training sample problem,”In: 8th Ibero-american Congress on Pattern Recognition, pp. 424–431.
4. Barandela, R., Sanchez, J. S., Garcia, V. and Rangel, E. (2003), “Strategies for learning in class imbalance problems,” Pattern Recognition, Vol. 36, No. 3, pp. 849-851.
5. Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003). “Balancing training data for automated annotation of keywords,” A case study. WOB, pp.10-18.
6. Batista, G., Prati, R.C., and Monard, M.C. (2004),“A study of the behavior of several methods 2009 International Conference on Advanced Information Technologies (AIT)for balancing machine learning training data,” SIGKDD Explorations, Vol. 6, No. 1,pp. 20-29.
7. Berson, A., Smith, S., Thearling , K., (1999) “Building Data Mining application for CRM, ” McGraw-Hill.
8. Bianchi C. and Montemaggiore G. B. (2008), “Enhancing Strategy Design and Planning in Public Utilities through “Dynamic” balanced scorecards:Insight from a Project in a City Water Company, ” System Dynamic Review Vol. 24, No. 2, (summer 2008): 175-213.
9. Brachman, R. and Anand, T. (1996), “The Process of Knowledge Discovery in Databases: A Human Centered Approach,” in A KDDM, AAAI/MIT Press, 37-58.
10. Breiman, L (1996), “Bagging predictors,” Machine Learning, 24 (2):123-140.
11. Brighton, H. and Mellish, C. (2002), “Advances in instance selection for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, pp. 153-172.
12. Burez, J., & Van den Poel, D. (2009), “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, 36(3), 4626-4636.
13. Burges, C.J.C. (1998), “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167.
14. C. Drummond, R.C. Holte (2003), “C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” Workshop on Learning from Imbalanced Datasets, NRC 47381.
15. C.-C. Chang and C.-J. Lin (2001). “LIBSVM: a library for support vector machines,” Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
16. Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.
17. Chawla, N. V., Japkowicz, N. and Kolcz, A. (2004), “Editorial: special issue on learning from imbalanced data sets,” SIGKDD Explorations, Vol. 6, No. 1, pp. 1-6.
18. Chung, H.M. & Gray, P. (1999). Special Section: Data mining. Journal of management information systems, Vol. 16, No. 1, 11-16, ISSN 0724-1222
19. C. J. C. Burges, (1998), "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, Vol. 2, No. 2.
20. Clark, P. and Niblett, T (1989) The CN2 induction algorithm. Machine Learning 3(4):261-283.
21. Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233-240). ACM.
22. Del-Hoyo, R., Buldain, D., & Marco, A. (2003). Supervised classification with associative SOM. In Seventh international work-conference on artificial and natural neural networks, IWANN 2003. Lecture notes in computer science (Vol.2686, pp. 334–341).
23. D. Hand, H. Mannila, P. Smyth (2001). "Principles of Data Mining". MIT Press, Cambridge, MA.
24. Dorian Pyle(1999), "Data Preparation for Data Ming, Morgan Kaufmann.
25. Dunham, M. H.(2003), "Data Mining Introductory and Advanced Topics," N. J. , Pearson Education Inc.
26. Fawcett T. (2006), "An introduction to ROC analysis," Pattern Recognition Letters, vol.27, pp. 861-874.
27. Fayyad, M.U(1996), "Data Mining and Knowledge Discovery: Making Sense Out of Data, " IEEE Expect, 11(10), 20-25.
28. Frawley, W.J., Sharpiro, G. P. and Matheus C. J.(1992), "knowledge Discovery in Database: An Overview," AI Magazine, 13(3), 57-10.
29. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 42(4), 463-484.
30. Grupe, G. H. and Owrang(1995), "M. M “Database Mining Discovering New Knowledge and Cooperative Advantage," Information System Management, l(12), 26-31.
31. Haibo He and Edwardo A. Garcia. (2009). Learning from imbalanced data. IEEE Transactions On Knowledge And Data Engineering, 21(9):1263–1284.
32. Han, J. and Kamber M.(2001), "Data Mining: Concepts and Techniques," Academic Press, San Diego.
33. Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36.
34. H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The data boost-IM approach,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 30–39, 2004.
35. In N. Japkowicz, editor, Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05. AAAI, 2000.
36. Japkowicz, N. 2000. The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI’2000).
37. K. Chen, B.-L. Lu, and J. T. Kwok. (2006), “Efficient classification of multi-label and imbalanced data using min-max modular classifiers,” in Proc. Int. Joint Conf. Neural Netw., pp. 1770–1775.
38. Kleissner, C. (1998), “Data Mining for the Enterprise,” Proceedings of the 31st Annual Hawaii International Conference On System Sciences, pp. 295-304.
39. Kohavi, Ron. (1995), “A study of cross-validation and bootstrap for accuracy estimation and model selection,” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. 2 (12): 1137–1143.
40. Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.
41. Liu, J. Hu, Q. Yu, D. (2008), “A comparative study on rough set based class imbalance learning,” Knowledge-Based Systems 21, pp.753–763.
42. Liu, X. Y., Wu, J. X., & Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, 39(2), 539-550.
43. M.A. Maloof. (2003), “Learning when data sets are Imbalanced and when costs are unequal and unknown,” ICML-2003 Workshop on Learning from Imbalanced Data Sets.
44. N. Chawla, A. Lazarevic, L. Hall and K. Bowyer.(2003), “SMOTEBoost: improving prediction of the minority class in boosting,” 7th uropean Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik,Croatia , pp. 107-119.
45. Quinlan, J.R. (1986), “Induction of Decision Tree,”Machine Learning, Vol. 1, No. 1, pp.81-106.
46. Quinlan, J.R. (1993), “C4.5: Programs for Machine Learning,” Morgankaufmann, San Mateo,CA.
47. Rakesh Agrawal, Tomasz Imielinskim and Arun Swami (1993), "Database Mining: A Performance Perspective," IEEE Trans on Knowledge and Data Engineering, 5(6), 914-925.
48. Reinartz, T. (2002), "A unifying view on instance selection," Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
49. Rumelhart, D.E., McClelland, J.L., and the PDP Research Group,(1986).PARALLEL DISTRIBUTED PROCESSING ,Vol. 1,MIT Press, Cambridge, MA.
50. S. Chen, H. He, and E. A. Garcia (2010), “Ramoboost: Ranked minority oversampling in boosting,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1624– 1642.
51. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 40(1), 185-197.
52. Seymour Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70:320–328, 1975.
53. S. K. Shevade, S. S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy (2000), "Improvements to the SMO Algorithm for SVM Regression," IEEE TRANSACTIONS ON NEURAL NETWORKS, 11(5), 1188-1193.
54. Stone, M. (1974). “Cross-validatory choice and assessment of statistical predictions,” J. Roy. Statist. Soc. Ser. B, 36:111–147.
55. Su, C. T., & Hsiao, Y. H. (2007). An evaluation of the robustness of MTS for imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 19(10), 1321-1332.
56. Su, C.-T., Chen, L.-S. and Yih, Y.(2006),“ Knowledge acquisition through information granulation for imbalanced 2009 International Conference on Advanced Information Technologies (AIT)data,” Expert System with Applications,Vol. 31, No. 3, pp. 531-541
57. Su, C.-T. and Hsiao, Y.-H.( 2007), “An Evaluation of the Robustness of MTS for Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 10, pp.1321-1332.
58. Taft LM, Evans RS, Shyu CR, et al. (2009) , “Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery,” Journal of biomedical informatics. 42:356–64.
59. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth (1996), “From data mining to knowledge discovery: An overview,” In Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI Press.
60. Vapnik V. (1995) , “The Nature of Statistical Learning Theory,” Springer ,New York.
61. Wang, B. X., & Japkowicz, N. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25(1), 1-20.
62. Weiss, G. (1995), “Learning with rare cases and small disjuncts,” Proceedings of the Twelfth International Conference on Machine Learning.
63. Weiss, G. (2004), “Mining with rarity: a unifying framework,” SIGKDD Exploration, Vol. 6, No. 1, pp. 7-19.
64. Wasikowski, M., & Chen, X. W. (2010), “Combating the Small Sample Class Imbalance Problem Using Feature Selection,” IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388-1400.
65. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. (2007), “Top 10 Algorithms in Data Mining,” Knowledge and Information Systems (14:1), pp. 1-37.
66. Yoav Freund and Robert E. Schapire. (1996), “Experiments with a new boosting algorithm,” In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156.
67. Yoav Freund and Robert E. Schapire. (1996), “Game theory, on-line prediction and boosting,” In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325–332.
68. Zhang, J. and Mani, I. (2003), “KNN approach to unbalanced data distributions: A case study involving information extraction,” in Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets.
【網路部分】
甲、 台灣Wiki(accessed 2016/04/29, available at:
http://www.twwiki.com/wiki/ROC%E6%9B%B2%E7%B7%9A
乙、 教育部數位教學資源入口網(accessed 2016/04/29, available at: http://content.edu.tw/senior/life_tech/tc_t2/inform/data2.htm