基於單一與混合特徵選取方法之比較｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	張櫻馨 Ying-Hsin, Chang
論文名稱：	基於單一與混合特徵選取方法之比較
指導教授：	蔡志豐
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理學系 Department of Information Management
論文出版年：	2017
畢業學年度：	105
語文別：	中文
論文頁數：	89
中文關鍵詞：	資料探勘、機器學習、資訊融合、特徵選取、支援向量機
外文關鍵詞：	KDD, Machine Learning, Information Fusion, Feature Selection, Support Vector Machines
相關次數：	點閱：12 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在我們現今生活中，我們面臨巨量資料（Big Data）的問題，還需要考慮到資料的即時性，如何在有限的資源與時間之下，進行資料探勘，找出有趣的樣式，我們首要考慮的是資料前處理（Data Pre-processing），將特徵選取處理後的資料應用在分類器，提高模型預測正確率，進而幫助使用者做決策。

本研究為探討特徵選取（Feature Selection）作為資料前處理的步驟，將不相關、冗餘的特徵（資料的屬性）刪除，換句話說，就是將原始資料集利用特徵選取的演算法，萃取出有用的特徵，或是足以代表整個資料集的資料值，並將這些特徵值重新組成一個新的資料集，再丟入SVM 支援向量機分類器中，希望可以透過特徵選取的方式，改善模型的正確率與執行的效能。

目前大部分的特徵選取大多為單一（競爭式）特徵選取，本研究想加入資訊融合（Information Fusion）的概念，將實驗設計為UCI 公開資料集與其他公開資料集中，取得28 個完整資料集，進行單一（競爭式）特徵選取與混合式資料選取的比較，進一步探討不同維度、類型的資料對於不同方式的特徵選取的影響，以提出資訊融合（Information Fusion）概念的混合式特徵選取是否能幫助處理各種類型的資料集，並可大幅度的提升預測模型的正確率。

In our current life, we not only face the huge data （Big Data） problem, but also need to take into account the immediacy of information. Under limited resources and
time, it is important to know how to perform data mining to find interesting style. We first consider data pre-processing for feature selection, and apply the selected data to construct the classifier, which could improve the classificaiton accuracy of the model, and help users make decisions.

In this thesis, we discuss the feature selection as the preprocessing step, and remove irrelevant and redundant features （ attributes of the data） from a given dataset. In other words, the feature selection algorithm is used to idenitfy useful or represenative attributes
from the entire data set. We reassemble these attributes into a new data set and then use the support vector machine classifier to improve the correctness and efficiency of the model.

Since most related studies only focus on single （competitive） feature selection, this thesis applies the concept of information fusion for multiple feature selection results. The experiments are based on 28 UCI public datasets. The purpose of this thesis is to
combine multiple feature selection methods. Under different dimensions and data types of information, we are able to understand whether combininng different feature selection results can perform better than single results in terms of classificaiton performance.

摘要 ..... i
Abstract ..... ii
誌謝 ..... iii
目錄 ...... iv
圖目錄 ..... vi
表目錄 ...... vii
第一章 緒論 ................................................ 1
1 研究背景 ............................................... 1
2 研究動機 ............................................... 3
3 研究目的 ............................................... 4
4 研究架構 ............................................... 5
第二章 文獻探討 ............................................ 6
1 特徵選取 ............................................... 6
1.1 基因演算法（Genetic Algorithm, GA） ................. 10
1.2 主成分分析（Principal Components Analysis, PCA） .... 13
1.3 決策樹C4.5（Decision Tree C4.5, DT） ................ 14
1.4 資訊融合之特徵選取 .................................. 16
2 監督式學習於分類器之應用 .............................. 17
2.1 監督式學習 .......................................... 17
2.2 支援向量機（Support Vector Machines, SVM） .......... 18
第三章 實驗方法 ........................................... 20
1 實驗架構 .............................................. 20
2 實驗參數設定 .......................................... 23
2.1 GA 基因演算法（Wrappers） ........................... 23
2.2 PCA 主成分分析（Filters） ........................... 24
2.3 C4.5 決策樹（Embedded） ............................. 24
2.4 SVM 支援向量機 ...................................... 24
3 實驗一 ................................................ 24
3.1 Baseline ............................................ 25
3.2 單一式特徵選取 ...................................... 26
3.3 混合式特徵選取 ...................................... 27
4 實驗二 ................................................ 28
第四章 實驗結果 ........................................... 28
1 實驗設定 .............................................. 28
1.1 資料集 .............................................. 28
1.2 實驗電腦環境 ........................................ 30
1.3 模型驗證準則 ........................................ 30
2 PCA 主成分分析的資訊保留率之選擇 ...................... 30
3 實驗一結果 ............................................ 32
3.1 類別型資料單一與混合特徵選取屬性集合大小之比較 ...... 33
3.2 數值型資料單一與混合特徵選取屬性集合大小之比較 ...... 35
3.3 混合型資料單一與混合特徵選取屬性集合大小之比較 ...... 37
3.4 類別型資料（Categorical Data）的SVM 分類器結果 ...... 39
3.5 數值型資料（Numeric Data）的SVM 分類器結果 .......... 43
3.6 混合型資料（Mixed Data）的SVM 分類器結果 ............ 46
3.7 各資料集正確率最佳的方法 ............................ 49
3.8 初始資料特徵選取正確率之比較 ........................ 52
4 實驗二結果 ............................................ 53
4.1 單一與混合特徵選取屬性集合大小之比較 ................ 53
4.2 高維度資料的SVM 分類器結果 .......................... 55
4.3 各資料集正確率最佳的方法 ............................ 58
5 實驗結論 .............................................. 58
第五章 結論 ............................................... 60
1 結論與貢獻 ............................................ 60
2 研究限制與後續研究建議與方向 .......................... 62
參考文獻 .................................................. 64
附錄一 特徵選取的結果 ..................................... 69
1 研究一 單一特徵選取的結果 ............................. 69
2 研究二 單一特徵選取的結果 ............................. 74
                                

[1] W. J. Frawley, G. P. Shapiro and C. J. Matheus, “Knowledge Discovery in Databases: An Overview” AI Magazine, Vol. 13, pp. 57-70. Nov. 1992.
[2] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant, “Detecting influenza epidemics using search engine query data” Nature , pp. 1012-1014. Feb. 2009.
[3] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Preprocessing for Supervised Leaning” International Journal of Computer Science, Vol. 1, No. 12, pp. 4091-4096. 2007.
[4] D. M. Strong, Y. W. Lee, and R. Y. Wang, “Data Quality In Context” Communications of The ACM, Vol. 40, No. 5, pp. 103-110. May. 1997.
[5] J. Han, J. Pei and M. Kamber, “Classification: Basic Concepts,” in Data mining: concepts and techniques,3th ed.ELSEVIER,2011,ch.8, pp. 327-385.
[6] I. Guyon, A. Elisseeff, “An Introduction to Variable and Feature Selection” Journal of machine learning research, pp. 1157-1182. Mar. 2003.
[7] R. Kohavi, G. H. John, “Wrappers for feature subset selection” Artificial Intelligence, pp.273-324. May. 1996.
[8] Y. Zhai, YS. Ong and I. W. Tsang, “The Emerging "Big Dimensionality"” IEEE Computational Intelligence Magazine, pp. 14-26. Aug. 2014.
[9] M. Haghighat , M. Abdel-Mottaleb, and W. Alhalabi, “Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition” IEEE Transactions On Information Forensics And Security, Vol. 11, No. 9, Sep. 2016.
[10] P. N. Sabes, M. I. Jordan, “Reinforcement Learning by Probability Matching” Advances in Neural Information Processing Systems, 1995.
[11] P. Zhu, W. Zhu, Q. Hu,C. Zhang,W. Zuo, “Subspace clustering guided unsupervised feature selection.” Pattern Recognition, Vol. 66, pp. 364-374. Jun. 2017.
[12] JH Holland, “Interim and Prospectus” in Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. Bradford Books, 1992, ch 10, pp. 171-181.
[13] C. Cortes, V. Vapnik, “Support-Vector Networks” Machine Learning, pp.273-297. 1995.
[14] J.G. Carbonell, R.S. Michalski, T.M. Mitchell, “An overview of machine learning” in Machine Learning: An Artificial Approach, Tioga Publishing Co., 1983, ch 1, pp. 3-20.
[15] M. Mohri, “Multi-Class Classification” in Foundations of Machine Learning, MIT press, 2012, ch8, pp.183-207.
[16] PC. Chang, CH. Liu, “A TSK type fuzzy rule based system for stock price prediction” Expert Systems with Applications, pp. 35-144. Aug. 2008.
[17] G. James, D. Witten, T. Hastie, R. Tibshirani, “Classification” in An introduction to statistical learning: with applications in R,1th ed. Springer, Jun. 2013. ch 4, pp.129-170.
[18] A. M. MartõÂnez and A. C. Kak, “PCA versus LDA” IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 23, No. 2, Feb. 2001.
[19] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” International Joint Conference on Articial Intelligence (IJCAI),Vol. 14, No.2, pp. 1137-1145. 1995.
[20] H. Ince and T. B. Trafalis, “Kernel principal component analysis and support vector machines for stock price prediction” IIE Transactions, pp. 629–637. Mar. 2007.
[21] A. Kalousis, J. Prados, M. Hilario, “Stability of Feature Selection Algorithms: a study on high dimensional spaces” Knowledge and information systems, pp. 95-116. Mar. 2007.
[22] M. Dash, H. Liu, “Feature Selection for Classification” Intelligent Data Analysis, Vol. 1,pp. 131-156. 1997.
[23] P. M. Narendra,. K. Fukunaga,, “A branch and bound algorithm for feature selection” IEEE Transactions on Computers, pp. 917-922. Sep. 1977.
[24] H. Liu, H. Motoda, “Perspectives of Feature Selection” in Feature selection for knowledge discovery and data mining, Springer Science & Business Media, Vol. 454., 2012. ch 2. pp. 17-38.
[25] H. Liu and L. Yu, “Toward Integrating Feature Selection Algorithms for Classification and Clustering” IEEE Transactions on knowledge and data engineering, pp. 491-502. 2005.
[26] V. Kumar and S. Minz, “Feature Selection: A literature Review” Smart Computing Review, Vol. 4, No. 3, pp. 211-229. Jun. 2014.
[27] KJ. Kim, I. Han, “Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index”, Expert Systems with Applications, pp. 125–132. 2000.
[28] Q. Guo, W. Wu, DL. Massart, C. Boucon, S. D. Jong, “Feature selection in principal component analysis of analytical data”, Chemometrics and Intelligent Laboratory Systems, Vol. 61, pp. 123-132. Feb. 2002.
[29] B. E. Boser, I. M. Guyon, V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers” Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp. 144-152. Jul. 1992.
[30] R. Bekkerman, R. El-Yaniv, N. Tishby, Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization” Journal of Machine Learning Research, pp. 1183-1208. 2003.
[31] J. R. Quinlan, “Constructing Decision Tree” in C4. 5: programs for machine learning, Elsevier, ch 2, pp. 17-25. 2014.
[32] V. Kumar, M. Steinbach, PN. Tan, “Introduction To Data Mining” in Introduction To Data Mining, ch 4, pp.145-205. Mar. 2006.
[33] S. Wold, K. Esbensen, P. Geladi, “Principal component analysis” Chemometrics and intelligent laboratory systems, Vol .2, pp. 37-52. Aug. 1987.
[34] D. Enke, S. Thawornwong, “The use of data mining and neural networks for forecasting stock market returns”, Expert Systems with Applications, Vol. 29,pp. 927–940. 2005.
[35] ST. Li, SC. Kuo, “Knowledge discovery in financial investment for forecasting and trading strategy through wavelet-based SOM networks” Expert Systems with Applications, Vol. 34, pp. 935-951. Feb. 2008.
[36] A. Abraham, B. Nath, P Mahanti, “Hybrid intelligent systems for stock market analysis” Computational science-ICCS 2001, pp. 337-345. 2001.
[37] W. Siedlecki, J. Sklansky, “A note on genetic algorithms for large-scale feature selection” Pattern recognition letters, pp. 335-347. 1989.
[38] CF. Tsai, YC. Hsiao, “Combining multiple feature selection methods for stock prediction: union, intersection, and multi-intersection approaches” Decision Support Systems, Vol.50, pp. 258-269. Aug. 2010.
[39] S. Moon, H. Qi, “Hybrid dimensionality reduction method based on support vector machine and independent component analysis” IEEE transactions on neural networks and learning systems, Vol. 23, pp. 749-761. 2012.
[40] K. Tumer, J. Gosh, “Linear order statistics combiners for pattern classification, Combining Artificial Neural Networks” Combining Artificial Neural Networks, Ed. Amanda Sharkey, pp 127-162. 1999.
[41] F. Herrera, M. Lozano, JL. Verdegay, “Tackling real-coded genetic algorithms: Operators and tools for behavioural analysis” Artificial Intelligence Review, Vol. 12 , pp. 265–319. 1998.
[42] A. Venkatachalam, “M-infosift: A Graph-based Approach For Multiclass
document Classification” Master Of Science In Computer Science And Engineering, Aug. 2007.
[43] JJ. Grefenstette, “Optimization of control parameters of genetic algorithms” IEEE Transactions on systems, man, and cybernetics, Vol.16, pp. 122-128. 1986.
[44] L. Yu,, S. Wang, K. K. Lai, “Mining Stock Market Tendency Using GA-Based Support Vector Machines” Internet and Network Economics, pp. 336-345. 2005.

簡易檢索 / 詳目顯示

相關論文