| 研究生: |
許哲彰 Che-Chang Hsu |
|---|---|
| 論文名稱: |
不平衡數據的機器學習發展暨可視化辨識模型之應用 Machine learning development of imbalanced data and application of visual recognition model |
| 指導教授: |
王國雄
Kuo-Shong Wang |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
工學院 - 機械工程學系 Department of Mechanical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 89 |
| 中文關鍵詞: | 重新平衡支持向量機 、可視化辨識模型 、多元尺度變換 |
| 外文關鍵詞: | SVM-rebalancing, visual recognition model, multidimensional scaling |
| 相關次數: | 點閱:8 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不平衡數據集在機器學習的許多應用場景中是一個普遍存在的問題。如何在訓練集的某些類擁有較多的樣本,而某些類只有相對較少的樣本情況下,解決傳統分類器對少類分類失準的問題已成為機器學習目前面臨的一個挑戰。本研究從算法層面(algorithm level)出發,提出一種結合貝葉斯分類器與支持向量機的新模型,即重新平衡支持向量機(SVM-rebalancing)。在這個學習過程中,重新平衡參數(分類權值參數)提供了一個使各類別的分類權值趨於平衡的協調,並藉由求解重新平衡規劃問題使少類樣本獲得有效的可識別性。本研究次要旨在瞭解造成錯誤分類的可能來源是否不僅是不平衡,還是尚有其他因素導致這些誤分類。鑒於模式識別的純預測模型缺乏可視化理解訊息,像類神經網路和支持向量機這樣的黑盒方法(black box)無法提供可解釋的模型,造成了對誤分類的原因無法探究其根源。因此,本研究提出對核函數進行多元尺度變換的前處理以來建構低維數據的表示空間。在實踐中,可視化辨識模型表明數據的重疊分布、多峰分布、偏態分布也是造成分類器的分類性能不佳的其他原因。最後,本研究給予一項建議是:採用這樣的可視化辨識模型策略能夠告訴我們數據結構所出現的問題,一旦想再繼續提升分類器的性能時就能往該方面進行後續改良。
Imbalanced data is a common problem in many application domains of machine learning. How to solve the problem of misclassification of minority class samples by traditional classifiers has become a challenge in machine learning when some classes of training set have more samples, and some classes have relatively few samples. This paper proposes a new model combining Bayesian classifier and support vector machine (SVM) from the perspective of algorithm level, namely, SVM-rebalancing. In the learning process, the rebalance parameter (classification weight parameter) provides a coordination that balances the classification weight of each class. The problem is solved by rebalancing programming problem, so as to produce an effective identifiability for minority samples. The next study wants to understand whether the possible sources of misclassifications are not only the imbalance, but also other factors that cause to these misclassifications. In view of the purely predictive model of pattern recognition lacks visual understanding, black box methods such as neural networks and support vector machines cannot provide interpretable model, which makes it impossible to explore the sources of misclassification causes. Therefore, this study further proposes a pre-processing of multidimensional scaling of kernel functions to construct a visual low-dimensional data representation space. In practice, the visual recognition model indicates that the overlapping distribution, multimodal distribution, and skewed distribution of the data in the database are also other causes of poor classification performance of the classifier. Finally, this research gives a suggestion that using such a visual identification model strategy can tell us the problems that arise in the data structure, and once we further want to improve the performance of the classifier, we can make subsequent improvements in this aspect.
[1] Han, J., Kamber, M., Data Mining Concepts and Techniques., 2nd Ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000.
[2] Hastie, T., Tibshirani, R., & Friendman, J., The Elements of Statistical Learning: Data Mining, Inference and Prediction., Springer-Verlag, Berlin, Heidelberg, and New York, 2001.
[3] Witten, I. H., & Frank, E., Data Mining: Practical Machine Learning Tools and Techniques., 2nd Ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2005.
[4] Webb, A. R., Statistical Pattern Recognition., 2nd Ed., John Wiley & Sons, Chichester, England, 2002
[5] Chawla, N. V., Japcowicz, N., & Kolcz, A., “Editorial: Special Issue on learning from imbalanced datasets”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 1-6, 2004.
[6] Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C., “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 20-29, 2004.
[7] Visa, S., & Ralescu, A., “Issues in Mining Imbalanced Data Sets - A Review Paper”, In: Proceeding of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, Dayton, Ohio, USA, pp. 67-73, 2005.
[8] Kotsiantis, S., Kanellopoulos, D., & Pintelas, P., “Handling imbalanced datasets: A review”, GESTS International Transactions on Computer Science and Engineering, Vol. 30, no. 1, pp. 25-36, 2006.
[9] Merz, C.J., & Murphy, P.M., UCI Repository of machine learning databases. University of California, Irvine School of Law, http://www.ics.uci.edu/~mlearn/MLRepository.html.
[10] Vapnik, V. N., The Nature of Statistical Learning Theory., Springer-Verlag, Berlin Heidelberg, New York, 1995.
[11] Vapnik, V. N., “An Overview of Statistical Learning Theory”, IEEE Transaction on Neural Networks, Vol. 10, pp. 988-999, 1999.
[12] Duda, R. O., Hart, P. E., & Stork, D. G., Pattern classification., 2nd Ed., John Wiley & Sons, Inc., New York, 2001.
[13] Hsu, C. C., Wang, K. S., Chung, H. Y., & Chang, S. H., “Equation of SVM-rebalancing: the point-normal form of a plane for class imbalance problem”, Neural Computing and Applications, DOI https://doi.org/10.1007/s00521-018-3419-z, 2018. (Accepted)
[14] Provost, F., & Fawcett, T., “Robust Classification for Imprecise Environments”, Machine Learning, Vol. 42, no. 3, pp. 203–231, 2001.
[15] Wu, G., & Chang, E. Y., “Class-boundary alignment for imbalanced dataset learning”, In: Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, pp. 49-56, 2003.
[16] Veropoulos, K., Campbell, C., & Cristianini, N., “Controlling the sensitivity of support vector machines”, In: Proceedings of the International Joint Conference on AI, pp. 55-60, 1999.
[17] Chawla, N. V., Data mining and knowledge discovery handbook., Springer, Boston, MA, 2005.
[18] Akbani, R., Kwek, S., & Japkowicz, N., “Applying Support Vector Machines to imbalanced Datasets”, In: Proceedings 15th ECML, pp. 39-50, 2004.
[19] Yan, R., Liu,Y., & Jin, R., “On Predicting Rare Classes with SVM Ensembles in Scene Classification”, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03), Hong Kong, pp. 21-24, Apr. 2003.
[20] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, Vol. 16, pp. 321-357, 2002.
[21] Tang, Y., Zhang, Y.-Q., Chawla, N. V., & Krasser, S., “SVMs Modeling for Highly Imbalanced Classification”, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 39, no. 1, pp. 281-288, 2009.
[22] Domingos, P., “MetaCost: A general method for making classifiers cost-sensitive”, In: proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA: ACM Press, pp. 155-164, 1999.
[23] Tomek, I., “Two Modifications of CNN”, IEEE Transactions on Systems Man and Communications, SMC-6, pp. 769-772, 1976.
[24] Ho, T. K., "Random Decision Forest", In: proceedings of the 3rd Int'l Conf on Document Analysis and Recognition, Montreal, Canada, pp. 278-282, August, 1995.
[25] Wu, G., & Chang, E. Y., “KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution”, IEEE Transaction on Knowledge and Data Engineering, Vol. 17, no. 6, pp. 786-795, 2005.
[26] Chandola, V., Banerjee, A., & Kumar, V., “Anomaly detection: A survey”, ACM Computing Surveys, Vol. 41, no. 3, pp.1-58, 2009.
[27] Zheng, Z., Wu, X., & Srihari, R., “Feature selection for text categorization on imbalanced Data”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 80-89, 2004.
[28] 鍾鴻源,何誌祥,「基於貝氏資訊之萃取應用於支持向量機之類不平衡分類問題」,國立中央大學,碩士論文,民國98年。
[29] Hsu, C. C., Wang, K. S., & Chang, S. H., “Bayesian decision theory for support vector machines: Imbalance measurement and feature optimization”, Expert Systems With Applications, Vol. 38, no. 5, pp. 4698-4704, May 2011.
[30] Chung, H. Y., Ho, C. H., & Hsu, C. C., “Support vector machines using Bayesian-based approach in the issue of unbalanced classifications”, Expert Systems With Applications, Vol. 38, no. 9, pp. 11447-11452, September 2011.
[31] Kubat, M., & Matwin, S., “Addressing the Curse of Imbalanced Training Sets: One-sided Selection”, In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp.179-186, 1997.
[32] Van Rijsbergen, C. J., Information Retrieval., 2nd Ed., Butterworths, London, U.K, 1979.
[33] Buckland, M., & Gey, F., “The relationship between Recall and Precision”, Journal of American Society for Information Science, Vol. 45, no. 1, pp. 12-19, 1994.
[34] Bradley, A. P., “The use of the area under the ROC curve in the evaluation of machine learning algorithms”, Pattern Recognition, Vol. 30, no. 7, pp. 1145-1159, Jul. 1997.
[35] Cieslak, D. A., & Chawla, N. V., “Learning Decision Trees for Unbalanced Data”, European Conference on Principles and Practice of Knowledge Discovery in Databases, Antwerp, Belgium, pp. 241-256, 2008.
[36] Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J., Least squares support vector machines., World Scientific Publishing Co. Pte. Ltd., Singapore, 2002.
[37] Anderson, D. R., Sweeney, D. J., & Williams, T. A., Statistics for Business and Economics., 8nd Ed., Southwestern, Cincinnati, 2002.
[38] Vapnik, V. N., The Nature of Statistical Learning Theory., Springer-Verlag, Berlin Heidelberg, New York, 1995.
[39] Vapnik, V. N., “An Overview of Statistical Learning Theory”, IEEE Transaction on Neural Networks, Vol. 10, pp. 988–999, 1999.
[40] 蘇木春、張孝德,機器學習:類神經網路、模糊系統以及基因演算法則,四版,全華出版社,台北市,2016年。
[41] 葉怡成,類神經網路模式應用與實作,九版,儒林出版社,台北市,2009年。
[42] 邊肇祺,張學工等編著,模式識別,二版,清華大學出版社,北京市,2000年。
[43] 周志華,王玨,機器學習及其應用,清華大學出版社,北京市,2009年。
[44] Rokach, L., Pattern classification using ensemble methods., World Scientific Publishing Co. Pte. Ltd., Singapore, 2010.
[45] Joshi, M. V., “On evaluating performance of classifiers for rare classes”, the Second IEEE International Conference on Data Mining (ICDM'02), Washington, D. C., USA, pp. 641-644, 2002.
[46] Breiman, L., “Bias, Variance and Arcing Classifiers”, Technical Report 460, Statistics Department, University of California, Berkeley, 1996.
[47] 徐天祿,陳俊言,盧欣農,許智誠,許哲彰,“驗鈔機的感測方法”,中華民國發明專利第I626625號,公告日2018年。
[48] 菲謝蒂(Mark Fischetti)著,”驗鈔機如何認出假鈔?”,鍾樹人譯,科學人雜誌,遠流出版公司,第20期,10月號,2003年。
[49] 朱昭蓉,錢迺文:2018年國際鈔券研討會-公務出國報告資訊網。2018年8月23日,取自https://report.nat.gov.tw/ReportFront/PageSystem/reportFileDownload/C10701146/001。
[50] Weston, J., & Watkins, C., “Support Vector Machines for Multi-Class Pattern Recognition”, In: Proceedings of the Seventh European Symposium On Artificial Neural Networks, Bruges, Belgium, pp. 219-224, 1999.
[51] Krishnaiah, P. R., & Kanal, L. N., Classification, Pattern Recognition, and Reduction of Dimensionality., North-Holland Pub. Co., New York, 1982.
[52] Platt, J., Cristianini, N., & Shawe-Taylor, J., “Large margin DAGs for multiclass classification”, In: Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachussets, pp. 547-553, 2000.
[53] Hsu, C. C., Wang, K. S., Chung, H. Y., & Chang, S. H., “A study of visual behavior of multidimensional scaling for kernel perceptron algorithm”, Neural Computing and Applications, Vol. 26, no. 3, pp. 679-691, 2015.