| 研究生: |
廖珮祺 Pei-Qi Liao |
|---|---|
| 論文名稱: |
樣本選取方法於多分類資料集之影響:多對多、一對多與一對一 Instance Selection Methods in Multi-Class Classification Datasets: All versus All, One versus All, and One versus One |
| 指導教授: |
蔡志豐
Chih-Fong Tsai |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 102 |
| 中文關鍵詞: | 資料前處理 、樣本選取 、特徵選取 、多分類資料集 、資料探勘 |
| 外文關鍵詞: | Data pre-processing, Instance selection, Feature selection, Multi-class dataset, Data mining |
| 相關次數: | 點閱:17 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
巨量資料(Big data)的時代來臨,將這些資料轉化為有用的資訊時,若沒有經過適當的前處理,訓練出的模型可能會受到其中的雜訊 (Noizy)影響,而使預測能力降低。在過去的研究中顯示,透過樣本選取(Instance selsction)方法能夠有效的篩選出資料集中代表性的資料,提升模型的效能與準確度。在這些相關研究中,較少討論在資料集為多分類的情況下,是否有不同的處理方法能夠提升樣本選取的效能。因此在本論文中欲探討:先對多分類資料集進行本研究中提出的多元分類處理方法後,再進行樣本選取,對於模型建立的影響。
本研究提出了三種多分類資料集的多元分類處理方法:多對多(AvA)、一對多(OvA)以及一對一(OvO),並搭配三種樣本選取方法:樣本學習演算法(Instance based learning algorithm,IB3)、遞減式降低最佳化程序(Decremental reduction optimization procedure 3, DROP3)與基因演算法(Genetic algorithm, GA),使用支持向量機(Support vector machine, SVM)與K鄰近值分類演算法(k-nearest neighbors classification algorithm, KNN)作為分類器,評估訓練模型最佳的搭配組合。於實驗第二階段進一步加入特徵選取方法(Feature selection),探討特徵選取搭配多元分類處理後的樣本選取,對於建立訓練模型的影響。
本研究使用UCI與KEEL上20個不同類型的多分類資料集,進行不同多元分類處理與樣本選取方法組合。根據實驗結果發現,以多元分類處理OvO搭配樣本選取演算法DROP3,在分類器KNN的模型建立之下,獲得最佳的平均結果,與未經過樣本選取方法的KNN建模結果相比,AUC指標提升了6.6%。
The big data generation has come. When turning these data into useful information, if they are out of proper pre-processing, the noise in data may reduce the predictive ability of the training model. In previous research, it has shown that the instance selection methods can effectively selection the representative data from the datasets, and improve the performance and accuracy of the model. Among the research, it rarely discusses whether there are any processing methods that can improve the efficiency of instance selection when the datasets are multi-classified. Therefore, this thesis aims to discuss about the impact of the multi-class classification methods proposed in this research with the instance selection methods in multi-class datasets.
This study proposes three methods for multi-class classification processing in multi-class datasets: All versus All (AvA), One versus All (OvA), and One versus One (OvO), with three instance selection methods: Instance based learning algorithm 3 (IB3), Decremental reduction optimization procedure 3 (DROP3) and Genetic algorithm (GA). Using Support vector machine (SVM) and the k-nearest neighbors classification algorithm (KNN) as classifiers to evaluate which method is the best combination. In the second stage of the study, we add the feature selection method to find out the impact between feature selection and instance selection under the multi-class classification methods.
This study uses 20 different types of multi-class datasets from UCI and KEEL, and goes through different combination of multi-class classification methods and instance selection methods. The empirical results show that, the combination of multi-class classification method-OvO with instance selection method-DROP3, under classifier KNN, obtained the best average results. Comparing to the results of the baseline which is without instance selection, the AUC index has improved 6.6%.
[1] Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107.
[2]Fan, W., & Bifet, A. (2013). Mining big data: current status, and forecast to the future. ACM sIGKDD Explorations Newsletter, 14(2), 1-5.
[3] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-37.
[4] Pyle, D. (1999). Data preparation for data mining. morgan kaufmann.
[5] Kotsiantis, S. B., Kanellopoulos, D., & Pintelas, P. E. (2006). Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2), 111-117.
[6] Reinartz, T. (2002). A unifying view on instance selection. Data Mining and Knowledge Discovery, 6(2), 191-210.
[7] Brighton, H., & Mellish, C. (2002). Advances in instance selection for instance-based learning algorithms. Data mining and knowledge discovery, 6(2), 153-172 .
[8] García-Pedrajas, N., & De Haro-García, A. (2014). Boosting instance selection algorithms. Knowledge-Based Systems, 67, 342-360.
[9] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
[10] Dash, M., & Liu, H. (1997). Sikora, R., & Piramuthu, S. (2007). Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 180(2), 723-737.
[11] Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data (pp. 37-46).
[12] Armi, L., & Fekri-Ershad, S. (2019). Texture image analysis and texture classification methods-A review. arXiv preprint arXiv:1904.06554.
[13] Saidi, M., Bechar, M. E. A., Settouti, N., & Chikh, M. A. (2018). Instances selection algorithm by ensemble margin. Journal of Experimental & Theoretical Artificial Intelligence, 30(3), 457-478.
[14] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.
[15] Hart, P. (1968). The condensed nearest neighbor rule. IEEE transactions on information theory, 14(3), 515-516.
[16] Gates, G. (1972). The reduced nearest neighbor rule. IEEE transactions on information theory, 18(3), 431-433.
[17] Ritter, G., Woodruff, H., Lowry, S., & Isenhour, T. (1975). An algorithm for a selective nearest neighbor decision rule. IEEE Transactions on Information Theory, 21(6), 665-669.
[18] Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
[19] Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3), 257-286.
[20] Garcia, S., Derrac, J., Cano, J., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, 34(3), 417-435.
[21] Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine learning, 6(1), 37-66.
[22] Tsymbal, A. (2004). The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2), 58.
[23] Grochowski, M., & Jankowski, N. (2004, June). Comparison of instance selection algorithms II. Results and comments. In International Conference on Artificial Intelligence and Soft Computing (pp. 580-585). Springer, Berlin, Heidelberg.
[24] García-Pedrajas, N. (2009). Constructing ensembles of classifiers by means of weighted instance selection. IEEE Transactions on Neural Networks, 20(2), 258-277.
[25] Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
[26] Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence.
[27] Goldberg, D.E. (1989a). Genetic Algorithms in Search, Optimization, and Machine Learning. AddisonWesley, New York.
[28] Beasley, J. E., & Chu, P. C. (1996). A genetic algorithm for the set covering problem. European journal of operational research, 94(2), 392-404.
[29] Kumar, A. (2013). Encoding schemes in genetic algorithm. International Journal of Advanced Research in IT and Engineering, 2(3), 1-7.
[30] Herrera, F., Lozano, M., & Verdegay, J. L. (1998). Tackling real-coded genetic algorithms: Operators and tools for behavioural analysis. Artificial intelligence review, 12(4), 265-319.
[31] Elbeltagi, E., Hegazy, T., & Grierson, D. (2005). Comparison among five evolutionary-based optimization algorithms. Advanced engineering informatics, 19(1), 43-53.
[32] Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic algorithms: Part 1, fundamentals. University computing, 15(2), 56-69.
[33] Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. In Proceedings of the second international conference on genetic algorithms (Vol. 206, pp. 14-21).
[34] Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic algorithms: Part 2, research topics. University computing, 15(4), 170-181.
[35] Reeves, C. R. (1999). Foundations of genetic algorithms (Vol. 5). Morgan Kaufmann.
[36] Kazarlis, S. A., Bakirtzis, A. G., & Petridis, V. (1996). A genetic algorithm solution to the unit commitment problem. IEEE transactions on power systems, 11(1), 83-92.
[37] Srinivas, M., & Patnaik, L. M. (1994). Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 24(4), 656-667.
[38] Ishibuchi, H., Nakashima, T., & Nii, M. (2001). Genetic-algorithm-based instance and feature selection. In Instance selection and construction for data mining (pp. 95-112). Springer, Boston, MA.
[39] Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media.
[40] Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3), 131-156.
[41] Kumar, V., & Minz, S. (2014). Feature selection: a literature review. SmartCR, 4(3), 211-229.
[42] Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.
[43] Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and information systems, 34(3), 483-519.
[44] Carbonell, J. G., Michalski, R. S., & Mitchell, T. M. (1983). An overview of machine learning. In Machine learning (pp. 3-23). Morgan Kaufmann.
[45] Caruana, R., & Niculescu-Mizil, A. (2006, June). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning (pp. 161-168)
[46] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
[47] Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
[48] Lingras, P., & Butz, C. (2007). Rough set based 1-v-1 and 1-vr approaches to support vector machine multi-classification. Information Sciences, 177(18), 3782-3798.
[49] Fix, E. (1951). Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine.
[50] Bay, S. D. (1998, July). Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In ICML (Vol. 98, pp. 37-45).
[51] Mandong, A. M., & Munir, U. (2018, October). Smartphone Based Activity Recognition using K-Nearest Neighbor Algorithm. In Proceedings of the International Conference on Engineering Technologies, Konya, Turkey (pp. 26-28)
[52] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
[53] Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on systems, man, and cybernetics, 16(1), 122-128
[54] Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2), 415-425.
[55] Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of official statistics, 16(2), 113.
[56] Thanh Noi, P., & Kappas, M. (2018). Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery. Sensors, 18(1), 18.
[57] Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.
[58] Wilcoxon, F. (1946). Individual comparisons of grouped data by ranking methods. Journal of economic entomology, 39(2), 269-270.
[59] Sipser, M. (1996). Introduction to the Theory of Computation. ACM Sigact News, 27(1), 27-29.
[60] Jankowski, N., & Grochowski, M. (2004, June). Comparison of instances seletion algorithms i. algorithms survey. In International conference on artificial intelligence and soft computing (pp. 598-603). Springer, Berlin, Heidelberg.
[61] Ortigosa-Hernández, J., Inza, I., & Lozano, J. A. (2017). Measuring the class-imbalance extent of multi-class problems. Pattern Recognition Letters, 98, 32-38.