衡量資料相似度於遺漏值填補之研究｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	李妙翎 Miao-Ling Li
論文名稱：	衡量資料相似度於遺漏值填補之研究
指導教授：	蔡志豐 Chih-Fong Tsai
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理學系 Department of Information Management
論文出版年：	2017
畢業學年度：	105
語文別：	中文
論文頁數：	130
中文關鍵詞：	資料前處理、遺漏值、補值方法、資料相似性
外文關鍵詞：	Data Preprocessing, Missing Value, Imputation Method, Data Similarity
相關次數：	點閱：9 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

資料探勘技術逐漸被廣泛的應用在各領域當中，但遺漏值對於資料探勘來說，會造成無法分析或是結果有所偏差，使得探勘結果無法有效的分析出有用的資訊。近年來學者不斷提出新方法、採用機器學習演算法或是改善目前補值方法的流程等，來進行遺漏值的填補，目的是希望能找出不同領域或不同資料型態所適用的補值方法，或是期望能提高演算法的補值準確率與降低預測值與原始資料的誤差。
本研究提出一個資料中心為基準衡量資料間相似度的補值方法（Class Center based Missing Value Imputation for Incomplete dataset，CCMVI）演算法，其是一個以統計方法為基礎，並考量資料所屬類別、資料之間的相似性並根據資料的離散程度調整填補值。於實驗一與實驗二中選擇不同類型與不同領域的資料集，以CCMVI方法、統計方法、K-近鄰算法（KNN）演算法以及支援向量機（SVM）演算法做遺漏值的填補。最後利用分類準正確率、誤差值以及執行時間來作為衡量補值方法的成效。
從本研究的實驗一中得知，CCMVI方法於分類正確率比機器學習演算法高、補值時效略比統計方法差、誤差值與支援向量機相異不大。以整體的衡量來看，數值型與混合型資料適用於CCMVI補值方法，但實驗二所使用的數值型資料，其屬於軟體工程領域之資料集，卻不適用CCMVI補值法，因此也進一步的探討其原因，發現資料的分佈狀態會影響補值方法的選擇。

Data mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data.
In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods.
The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results.

摘要    i
Abstract    ii
誌謝    iii
目錄    iv
圖目錄    vi
表目錄    vii
附表目錄    ix
一、 緒論    1
1-1    研究背景    1
1-2    研究動機    2
1-3    研究目的    4
1-4    論文架構    5
二、 文獻探討    6
2-1    遺漏值介紹    6
2-1-1    完全隨機遺漏（Missing Completely at Random，MCAR）    6
2-1-2    隨機遺漏（Missing at Random，MAR）    7
2-1-3    非隨機遺漏（Not Missing at Random，NMAR）    7
2-2    遺漏值填補法    8
2-2-1    單一補值法（Single Imputation）    8
2-2-2    多重補值法（Multiple Imputation）    10
2-3    資料相似度衡量    15
2-3-1    歐幾里得距離（Euclidean Distance）    16
2-3-2    曼哈頓距離（Manhattan Distance）    16
2-3-3    夾角餘弦距離（Cosine Angle Distance）    16
三、 研究方法與設計    17
3-1    實驗架構    17
3-2    實驗資料集    18
3-2-1    實驗一 CCMVI方法與其他補值法應用於UCI各領域開放資料集    18
3-2-2    實驗二 CCMVI方法與其他補值法應用於軟體工程之軟體缺陷預測資料集    18
3-3    實驗一 CCMVI方法與其他補值法應用於UCI各領域開放資料集    20
3-3-1      CCMVI演算法    20
3-3-2    基準（Baseline）    28
3-3-3    支援向量機（SVM）補值法    28
3-3-4    K-近鄰算法（KNN）補值法    29
3-4    實驗二 CCMVI方法與其他補值法應用於軟體工程之軟體缺陷預測資料集    30
3-5    實驗驗證    30
3-5-1    分類正確率（Classification Accuracy）    30
3-5-2    時效性（Time Efficiency）    31
3-5-3    均方根誤差（RMSE）    31
3-5-4    平均絕對百分比誤差（MAPE）    32
3-5-5    T檢定：成對母體平均數差異檢定    32
四、 實驗結果    33
4-1    實驗準備    33
4-1-1    硬體設備    33
4-1-2    軟體    33
4-2    實驗一結果    34
4-2-1    分類正確率（Classification Accuracy）分析    34
4-2-2    時效性（Time Efficiency）分析    44
4-2-3    均方根誤差（RMSE）分析    48
4-2-4    平均絕對百分比誤差（MAPE）分析    58
4-2-5    T檢定：成對母體平均數差異檢定    63
4-2-6    實驗一總結    65
4-3    實驗二結果    69
4-3-1    分類正確率（Classification Accuracy）分析    69
4-3-2    時效性（Time Efficiency）分析    72
4-3-3    均方根誤差（RMSE）分析    73
4-3-4    平均絕對百分比誤差（MAPE）分析    74
4-3-5    T檢定：成對母體平均數差異檢定    76
4-3-6    實驗二總結    77
五、 結論    80
5-1    總結與探討    80
5-2    貢獻與未來研究方向    83
參考文獻    85
附錄一、 實驗一詳細數據    88
1-1    分類正確率（Classification Accuracy）    88
1-2    均方根誤差（RMSE）    95
1-3    平均絕對百分比誤差（MAPE）    104
附錄二、 實驗二詳細數據    109
2-1    分類正確率（Classification Accuracy）    109
2-2    均方根誤差（RMSE）    111
2-3    平均絕對百分比誤差（MAPE）    114

                                

[1] Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski, Lukasz Kurgan. (2007). The Knowledge Discovery Process, Springer US.
[2] Cemil Colak, Esra Karaman, M. Gokhan Turtay. (2015). Application of knowledge discovery process on the prediction of stroke, Computer Methods and Programs in Biomedicine, 119, 181–185.
[3] Esther-Lydia Silva-Ramírez, Rafael Pino-Mejías, Manuel López-Coello, María-Dolores Cubiles-de-la-Vega. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons, Networks, 24, 121–129.
[4] Ruilin Pan, Tingsheng Yang, Jianhua Cao, Ke Lu, Zhanchao Zhang. (2015). Missing data imputation by K nearest neighbours based on grey relational structure and mutual information, Springer Science+Business Media New York.
[5] Kamakshi LakshminarayanSteven A. HarpTariq Samad. (1999). Imputation of Missing Data in Industrial Databases, Applied Intelligence, 11, 259–275.
[6] Loris Nanni, Alessandra Lumini, Sheryl Brahnam. (2012). A classifier ensemble approach for the missing feature problem, Artificial Intelligence in Medicine, 55, 37–50.
[7] Li Zhang, Zhaohong Bing, Liyong Zhang. (2014). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data, Pattern Anal Applic, 18, 377–384.
[8] Chih-Fong Tsai, Fu-Yu Chang. (2016). Combining instance selection for better missing value imputation, The Journal of Systems and Software, 122, 63–71.
[9] Archana Purwar, Sandeep Kumar Singh. (2015). Hybrid prediction model with missing value imputation for medical data, Systems with Applications, 42, 5621–5631.
[10] Nuno Pombo, Paulo Rebelo, Pedro Araújo, Joaquim Viana. (2016). Design and evaluation of a decision support system for pain management based on data imputation and statistical models, Measurement, 93, 480–489.
[11] Donald B. Rubin. (1987). Multiple Imputation for Nonresponse in Surveys, Wiley.
[12] Rupam Deb, Alan Wee-Chung Liew. (2016). Missing value imputation for the analysis of incomplete traffic accident data, Information Sciences, 339, 274–289.
[13] Alireza Farhangfar, Lukasz Kurgan, Jennifer Dy. (2008). Impact of imputation of missing values on classification error for discrete data, Pattern Recognition, 41, 3692 – 3705.
[14] Shehroz S. Khan, Amir Ahma. (2004). Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters, 25, 1293–1302.
[15] Roderick J. A. Little, Donald B. Rubin. (2002). Statistical Analysis with Missing Data, New York, John Wiley.
[16] Julián Luengo, Salvador García, Francisco Herrera. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowl Inf Syst, 32, 77–108.
[17] Esther-Lydia Silva-Ramíreza, Rafael Pino-Mejías, Manuel López-Coelloa. (2015). Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns, Applied Soft Computing, 29, 65–74.
[18] Paul J. Rathouz, John S. Preisser. (2014). Missing Data: Weighting and Imputation, Encyclopedia of Health Economics, 292-298.
[19] Jane Y. Nancy, Nehemiah H. Khanna, Kannan Arputharaj. (2017). Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework, Computational Statistics and Data Analysis, 112, 63–79.
[20] Jing Tian, Bing Yu, Dan Yu, Shilong Ma. (2014). Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering, Springer Science+Business Media New York.
[21] Jason S. Haukoos, Craig D. Newgard. (2007). Advanced Statistics: Missing Data in Clinical Research—Part 1: An Introduction and Conceptual Framework, The Society for Academic Emergency Medicine.
[22] Rogier Donders, Geert J.M.G. van der Heijden, Theo Stijnen, Karel G M Moons. (2016). Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology, 59, 1087-1091.
[23] Farhadian Hadi, Katibeh Homayoon. (2017). New empirical model to evaluate groundwater flow into circular tunnel using multiple regression analysis, International Journal of Mining Science and Technology, 27, 415–421.
[24] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. (2006) Introduction to Data Mining, Addison Wesley.
[25] Evelyn Fix and J. L. Hodges, Jr. (1951). Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[26] Michelle H Cartwright, Martin John Shepperd and Qinbao Song. (2003). Dealing with Missing Software Project Data, Proceedings of the 9th International Software Metrics Symposium, Sydney, Australia, 154-165.
[27] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, and e. al. (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, vol. 17, 520-525.
[28] Corinna Cortes, Vladimir Vapnik. (1995). Support-vector networks, Machine Learning, 20, 273-297.
[29] Hyeran Byun and Seong-Whan Lee. (2003). A survey on pattern recognition applications of support vector machines, International Journal of Pattern and Artificial Intelligence, Vol. 17, No. 3, 459–486.
[30] Gautam Bhattachary, Koushik Ghosh, Ananda S. Chowdhury. (2012). An affinity-based new local distance function and similarity measure for kNN algorithm, Pattern Recognition Letters, 33, 356–363.
[31] Joseph Ahn, Moonseo Park, Hyun-Soo Lee, Sung Jin Ahn, Sae-Hyun Ji, Kwonsik Song, Bo-Sik Son. (2017). Covariance effect analysis of similarity measurement methods for early construction cost estimation using case-based reasoning, Automation in Construction.
[32] Jin Qi, Jie Hu, Ying-Hong Peng, Weiming Wang, Zhenfei Zhang. (2009). A case retrieval method combined with similarity measurement and multi-criteria decision making for concurrent design, Expert Systems with Applications, 36, 10357–10366.
[33] Shan Shen, Andre J. Szameitat, Annette Sterr. (2010). An improved lesion detection approach based on similarity measurement between fuzzy intensity segmentation and spatial probability maps, Magnetic Resonance Imaging, 28, 245–254.
[34] The CLUSTER Procedure: Clustering Methods. SAS/STAT 9.2 Users Guide. SAS Institute. Retrieved 2009-04-26.
[35] Gabor J. Szekely, Maria L Rizzo. (2005). Hierarchical clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method, Journal of Classification, 22, 151-183.
[36] Jin Qi, Jie Hu, YingHong Peng, Qiushi Ren, Weiming Wang, Zhenfei Zhan. (2011). Integration of similarity measurement and dynamic SVM for electrically evoked potentials prediction in visual prostheses research, Expert Systems with Applications, 38, 5044–5060.
[37] Nikola Minovski, Spela Zuperl, Viktor Drgan, Marjana Novic. (2013). Assessment of applicability domain for multivariate counter-propagation artificial neural network predictive models by minimum Euclidean distance space analysis: A case study, Analytica Chimica Acta, 759, 28–42.
[38] Michel Marie Deza, Elena Deza. (2009). Encyclopedia of Distances, Springer-Verlag Berlin Heidelberg.
[39] Mirco Kocher, Jacques Savoy. (2017). Distance measures in author profiling, Information Processing and Management, 53, 1103–1119.
[40] Ron Kohavi. (1995). A Study of Cross Validation and Bootstrap for Accuracy Estimation and Model Selection, Appears in the International Joint Conference on Articial Intelligence IJCAI.
[41] Khaled El Emam, Andreas Birk. (2000). Validating the ISO/IEC 15504 measures of software development process capability, The Journal of Systems and Software, 51, 119-149.
[42] Ali Idri, Ibtissam Abnane, Alain Abran. (2016). Missing data techniques in analogy-based software development effort estimation, The Journal of Systems and Software, 117, 595–611.
[43] Xinyang Deng, Qi Liu, Yong Deng, Sankaran Mahadevan. (2016). An improved method to construct basic probability assignment based on the confusion matrix for classification problem, Information Sciences ,340–341, 250–261.
[44] Lorenzo Mentaschi, Giovanni Besio, Federico Cassola, A. Mazzino. (2013). Problems in RMSE-based wave model validations, Ocean Modelling, 72, 53–58.
[45] Benyamin Khoshnevisan, Shahin Rafiee, Mahmoud Omid, Hossein Mousazadeh. (2014). Prediction of potato yield based on energy inputs using multi-layer adaptive neuro-fuzzy inference system, Measurement, 47, 521–530.
[46] MATLAB Documentation. MathWorks. Retrieved 14 August 2013.
[47] Geoffrey Holmes, Andrew Donkin, and Ian H. Witten. (1994). Weka: A machine learning workbench. Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia. Retrieved 2007-06-25.

簡易檢索 / 詳目顯示

相關論文