跳到主要內容

簡易檢索 / 詳目顯示

研究生: 朱啟源
Chi-yuan Chu
論文名稱: 資料前處理之研究:以基因演算法為例
Feature and Instance Selection Using Genetic Algorithms:An Empirical Study
指導教授: 蔡志豐
Chih-fong Tsai
李俊賢
Chun-shien Li
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
畢業學年度: 99
語文別: 中文
論文頁數: 62
中文關鍵詞: 資料探勘特徵選取基因演算法樣本選取
外文關鍵詞: data mining, feature selection, instance selection, genetic algorithms
相關次數: 點閱:24下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 特徵選取(feature selection)和樣本選取(instance selection)在資料探勘裡,是兩個很重要的資料前處理技術,主要目的是希望再給定一個資料集時,可以透過特徵選取技術來去除不相關或是冗餘的特徵值,或是透過樣本選取技術來消除重覆及錯誤的資料,特別的是基因演算法(genetic algorithm)是過去最被廣泛應用在這資料前處理技術的演算法,而目前這兩種資料前處理的方法,在過去往往是被分開探討的,所以目前尚未清楚特徵選取和樣本選取同時執行與個別單獨執行,其執行效能與結果有什麼樣的不同,因此本研究的目的是透過基因演算法去處理特徵選取與樣本選取,並且探討兩種資料前處理方法之間的順序,在不同的領域資料集中的分類表現,實驗的結果來自於不同領域的四個大型資料集與四個小型資料集在分類器(例如:support vector machines and k-nearest neighbor)上的表現,而其中這八個資料集的維度特徵與資料樣本數目並不相同,目的是希望可以將這樣的方法不僅可以應用在不同領域的資料集,還可以應用在差異性大的資料集,除此之外,本研究除了找到不同的資料前處理模式,更進一步的分析資料集的特性,目的是希望透過正確率與時效性的兩個層面,更進一步的探討那種特性的資料集適合應用何種資料前處理方法,透過找出一定的規律和準則,讓不同領域的資料集皆能夠在分類器上或實驗的時效性上,皆有較佳的表現。


    Feature selection and instance selection are two important data preprocessing steps in data mining, where the former aims at removing some irrelevant and/or redundant features from a given dataset and the later for discarding the faulty data. In particular, genetic algorithms have been widely used for these tasks in related studies. However, these two data processing tasks are generally considered separately in literature. It is unknown about the performance differences between performing both feature and instance selection and feature or instance selection individually. Therefore, the aim of this paper is to perform feature selection and instance selection based on genetic algorithms using different priorities to examine the classification performances over different domain datasets. Experimental results based on four small and large scale datasets containing various numbers of features and data samples show that performing both feature and instance selection usually make the classifiers (i.e., support vector machines and k-nearest neighbor) perform slightly poorer than feature selection or instance selection individually. However, while there is not a significant difference in classification accuracy between these different data preprocessing methods, the combination of feature and instance selection largely reduces the computational effort of training the classifiers than feature and instance selection individually. By considering both classification effectiveness and efficiency, performing feature and instance selection is the optimal solution for data preprocessing in data mining.

    摘要 i Abstract ii 目錄 iii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 6 1.4 研究步驟 8 第二章 文獻探討 10 2.1 特徵選取(feature selection) 10 2.2 樣本選取(instance selection) 12 2.3 基因演算法(Genetic Algorithms) 13 第三章 研究方法 18 3.1 資料集(datasets) 18 3.2 資料前處理:以特徵選取為例 19 3.3 實驗流程 22 3.4 基因演算法的參數設定 25 3.5 分類器的設計 25 第四章 實驗結果 30 4.1 小型資料集的實驗結果 30 4.2 大型資料集的實驗結果 32 4.3 實驗成本的比較 36 4.4 實驗結果建議 42 第五章 結論與建議 44 5.1 結論 44 5.2 未來展望與建議 46 參考文獻 47

    中文部分
    洪振富,2010,距離式特徵於資料自動分類之研究,國立中央大學,碩士論文。
    謝欣宏,2002,台鐵司機員排班與輪班問題之研究 – 以基因演算法求解,國立交通大學,碩士論文。
    英文部分
    D.R. Wilson, T.R. Martinez, 2000. Reduction techniques for instance-based learning algorithms, Machine Learning, Vol. 38, No. 3, pp. 257-286.
    G I. Bose, R.K. Mahapatra, 2001. Business data mining ─ a machine learning perspective, Information & Management, Vol. 39, No. 3, pp. 221-225.
    U. Fayyad, S.G. Piatetsky, P. Smyth, 1996. Advances in knowledge discovery and data mining, The MIT Press.
    J. Han, M. Kamber, 2000. Data mining: concepts and techniques. Morgan Kaufmann.
    S.F. Crone, S. Lessmann, R. Stahlbock, 2006. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing, European Journal of Operational Research, Vol. 173, No. 3, pp. 781-800.
    C.C. Aggarwal, P.S. Yu, 2001. Outlier detection for high dimensional data, in Proc. ACM SIGMOD Int. Conf. Management of Data, Santa Barbara, California, pp. 37-46.
    V. Barnett, T. Lewis, 1994. Outliers in statistical data. John Wiley & Son, New York.
    T. Reinartz, 2002. A unifying view on instance selection, Data Mining and Knowledge Discovery, Vol. 6, No. 2, pp. 191-210.
    J. Yang, S. Olafsson, 2006. Optimization-based feature selection with adaptive instance sampling, Computers & Operations Research, Vol. 33, No. 11, pp. 3088-3106.
    J. Li, M.T. Manry, P.L. Narasimha, C. Yu, 2006. Feature selection using a piecewise linear network, IEEE Transactions on Neural Networks, Vol. 17, No. 5, pp. 1101-1115.
    I. Guyon, A. Elisseeff, 2003. An introduction to variable and feature selection, Journal of Machine Learnig Research, Vol. 3, pp. 1157-1182.
    S. Gunal, R. Edizkan, 2008. Subspace based feature selection for pattern recognition, Information Sciences, Vol. 178, pp. 3716-3726.
    A. Kuri-Morales, F. Rodrı’guez-Erazo, 2009. A search space reduction methodology for data mining in large databases, Engineering Applications of Artificial Intelligence, Vol. 22, pp. 57-65.
    S. Piramuthu, 2004. Evaluating feature selection methods for learning in data mining applications, European Journal of Operational Research, Vol. 156, pp. 483-494.
    C.-F. Tsai, 2009. Feature selection in bankruptcy prediction, Knowledge-Based Systems, Vol. 22, No. 2, pp. 120-127.
    J.-S. Wang, J.-C. Chiang, 2008. A cluster validity measure with outlier detection for support vector clustering, IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics, Vol. 38, No. 1, pp. 78-89.
    D. Fragoudis, D. Meretakis, S. Likothanassis, 2002. Integrating feature and instance selection for text classification, in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 501-506.
    J.T. De Souza, R.A.F. Do Carmo, G. Augusto, L. De Campos, 2008. A novel approach for integrating feature and instance selection, in Proc. Int. Conf. Machine Learning and Cybernetics, pp. 374-379.
    J. Derrac, S. Garcia, F. Herrera, 2010. A survey on evolutionary instance selection and generation, International Journal of Applied Metaheuristic Computing, Vol. 1, No. 1, pp. 60-92.
    M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, 2000. Dimensionality reduction using genetic algorithms, IEEE Transactions on Evolutionary Computation, Vol. 4, No. 2, pp. 164-171.
    J.R. Cano, F. Herrera, M. Lozano, 2003. Using evolutionary algorithms as instance selection for data reduction: an experimental study, IEEE Transactions on Evolutionary Computation, Vol. 7, No. 6, pp. 561-575.
    M. Kudo, J. Sklansky, 2000. Comparison of algorithms that select features for pattern classifiers, Pattern Recognition, Vol. 33, pp. 25-41.
    W.B. Powell, 2007. Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience.
    M. Dash, H. Liu, 1997. Feature selection methods for classifications, Intelligent Data Analysis, Vol. 1, No. 3, pp. 131-156.
    Fayyad, U.M., Piatesky, S.G., Smyth, P., 1996. From Data Mining to Knowledge Discovery in Databases, AI Magazine, pp.37-54.
    A. Ghosting, S. Parthasarathy, M.E. Otey, 2008. Fast mining of distance-based outliers in high-dimensional datasets, Data Mining and Knowledge Discovery, Vol. 16, pp. 349-364.
    J. Derrac, S. Garcia, F. Herrera, 2010. IFS-CoCo: instance and feature selection based on cooperative coevolution with nearest neighbor rule, Pattern Recognition, Vol. 43, pp. 2082-2105.
    J.-F. Ramirez-Cruz, V. Alarcon-Aquino, O. Fuentes, L. Garcia-Banuelos, 2006. Instance Selection and Feature Weighting Using Evolutionary Algorithms, in Proc. Int. Conf. Computing, pp. 73-79.
    F. Ros, S. Guillaume, M. Pintore, J.R. Chretien, 2008. Hybrid genetic algorithm for dual selection, Pattern Analysis and Applications, Vol. 11, pp. 179-198.
    H. Ahn, K.-J. Kim, 2009. Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach, Applied Soft Computing, Vol. 9, No. 2, pp. 599-607.
    J.J. Grefenstette, 1986. Optimization of control parameters of genetic algorithms, IEEE Transactions on Systems, Man and Cybernetics, Vol. 16, No. 1, pp. 122-128.
    S.-Y. Ho, C.-C. Liu, S. Liu, 2002. Design of an optimal nearest neighbor classifier using an intelligent genetic algorithm, Pattern Recognition Letters, Vol. 23, pp. 1495-1503.
    K.J. Kim, I. Han, 2000. Genetic algorithm approach to feature discretization in artificial neural network for the prediction of stock price index, Expert Systems with Applications, Vol. 19, No. 2, pp. 125-132.
    L.I. Kuncheva, L.C. Jain, 1999. Nearest neighbor classifier : simultaneous editing and feature selection, Pattern Recognition Letters, Vol. 20, pp. 1149-1156.
    H. Byun, S.-W. Lee, 2003. A survey on pattern recognition applications of support vector machines, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 17, No. 3, pp. 459-486.
    H. Liu, H. Motoda, 2002. On issues of instance selection, Data Mining and Knowledge Discovery, Vol. 6, pp. 115-130.
    N. Jankowski, M. Grochowski, 2004. Comparison of instances selection algorithms I: algorithms survey, in Proc. Int. Conf. Artificial Intelligence and Soft Computing, pp. 598-603.
    M. Grochowski, N. Jankowski, 2004. Comparison of instances selection algorithms II: results and comments, in Proc. Int. Conf. Artificial Intelligence and Soft Computing, pp. 580-585.
    D.E. Goldberg, 1989. Genetic algorithms in search optimization and machine learning, Addition Wesley.
    P.G. Espejo, S. Ventura, F. Herrera, 2010. A survey on the application of genetic programming to classification, IEEE Transactions on Systems, Many, and Cybernetics – Part C: Applications and Reviews, Vol. 40, No. 2, pp. 121-144.
    D.L. Wilson, 1972. Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man and Cybernetics, Vol. 2, pp. 408-421.
    Haupt, L. Randy, S. E. Haupt, 1998. Practical genetic algorithms, Wiley, New York.
    M. Gen, R. Cheng, 2000. Genetic algorithms and engineering optimization, John Wiley & Sons.
    C. J. C. Burges, 1998. A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, Vol. 2, No. 2.
    B. Schlkopf, C. J. C. Burges, A. J. Smola, 1999, Introduction to support vector learning, advances in kernel methods-support vector learning, Cambridge.
    Kohavi, R., 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.1137-1145.
    Sikora Riyaz, Piramuthu Selwyn, 2007. Framework for efficient feature selection in genetic algorithm based data mining, European Journal of Operational Research, Vol. 180, Issue 2, pp. 723-737.

    QR CODE
    :::