跳到主要內容

簡易檢索 / 詳目顯示

研究生: 黃靖雅
Jing-Ya Huang
論文名稱: 遺漏值填補於網路評論有益性資料集之研究
Evaluation of missing value imputation methods for the helpfulness of online reviews
指導教授: 蔡志豐
Chih-Fong Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 61
中文關鍵詞: 資料前處理遺漏值補值方法網路評論
外文關鍵詞: data preprocessing, missing value, imputation, online review
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今生活裡,每一件事情都可以被大家公開評論,包括你看過的報章雜誌、書籍。網路評論已被認定為是可以信任的,用戶可以透過不同的方式提供網路評論,例如星級、文字、圖片和視頻。多數的用戶在購買商品和體驗前也都會先查看網路上的評論,當網路上資訊量過多的時候,就會造成資訊超載的問題。我們因此想對這些評論的資料去做資料探勘,利用機器學習的方法,處理及過濾這些大量的資訊。
    本研究使用網路評論有益性資料集。在進行資料清理階段時,我們發現這些在真實世界中的資料,資料遺漏的現象是非常普遍的。且鑒於目前現有的文獻中,並無針對各項監督式學習演算法,在於真實世界的資料運作中有針對遺漏值預測填補上的效能表現進行比較。因此,設計了兩個實驗來進行,於實驗一,對具遺漏值之網路評論有益性資料集中的評論者資料進行遺漏值填補方法,使得能建立良好的預測模式,幫助旅客或是旅館業者找出最有幫助之評論。而實驗二,則對現實世界中其它可能產生的遺漏現象作探討,運用程式模擬10%到50%的資料遺漏,除了比較不同補值法之間的效能差異外,也會對網路評論領域找出最好的資料填補方法。
    實驗中使用了三種類型的技術,如使用傳統的Case Deletion、平均眾數補值法、KNN、使用學術界常常運用的支持向量機進行補值,以及使用對遺漏值較不敏感的決策樹方法,直接處理遺漏值資料而不補值。於實驗後的結果得知,使用決策樹直接處理不完整資料得到的分類正確率結果最好。相信這樣的貢獻能協助未來使用者能更洽當且有效率的處理遺漏值問題,使得能更快進入到資料分析階段。


    In today's world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem.
    This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets.
    The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets.

    摘要 i Abstract ii 誌謝辭 iii 目錄 iv 圖目錄 vi 表目錄 vii 一、緒論 1 1-1 研究背景 1 1-2 研究動機 2 1-3 研究目的 4 1-4 研究架構 4 二、文獻探討 6 2-1 網路評論及有益性 6 2-2 遺漏值介紹 6 2-2-1 完全隨機遺漏(Missing Completely at Random,MCAR) 7 2-2-2 隨機遺漏(Missing at Random,MAR) 7 2-2-3 非隨機遺漏(Missing Not at Random,MNAR) 8 2-3 遺漏值填補法 9 2-3-1 單一補值法(Single Imputation) 9 2-3-2 多重補值法(Multiple Imputation) 11 三、研究方法 14 3-1 實驗設計 14 3-2 實驗架構 19 3-3 實驗一 20 3-4 實驗二 22 四、實驗結果 23 4-1 實驗一結果 23 4-1-1 分類正確率(Classification Accuracy) 23 4-1-2 靈敏度分析(Sensitivity Analysis) 24 4-1-3 特異度分析(Specificity analysis) 26 4-1-4 實驗一總結 27 4-2 實驗二結果 28 4-2-1 實驗二(I) 28 4-2-1-1 分類正確率(Classification Accuracy) 28 4-2-1-2 靈敏度分析(Sensitivity Analysis) 30 4-2-1-3 特異度分析(Specificity analysis) 31 4-2-2 實驗二(II) 33 4-2-2-1 分類正確率(Classification Accuracy) 33 4-2-2-2 靈敏度分析(Sensitivity Analysis) 34 4-2-2-3 特異度分析(Specificity analysis) 36 五、研究結論 38 5-1 研究發現 38 5-2 研究貢獻及未來方向 39 參考文獻 40 附錄一 44 附錄二 46 附錄三 48

    [1] K.Zhao, A. C.Stylianou, and Y.Zheng, “Sources and impacts of social influence from online anonymous user reviews,” Inf. Manag., vol. 55, no. 1, pp. 16–30, Jan.2018.
    [2] G.Askalidis, S. J.Kim, and E. C.Malthouse, “Understanding and overcoming biases in online review systems,” Decis. Support Syst., vol. 97, pp. 23–30, May2017.
    [3] Y.Pan andJ. Q.Zhang, “Born Unequal: A Study of the Helpfulness of User-Generated Product Reviews,” J. Retail., vol. 87, no. 4, pp. 598–612, Dec.2011.
    [4] S. M.Mudambi and D.Schuff, “WHAT MAKES A HELPFUL ONLINE REVIEW? A STUDY OF CUSTOMER REVIEWS ON AMAZON.COM 1,” vol. 34, no. 1, pp. 185–200, 2010.
    [5] R. E.Burnkrant and A.Cousineau, “Informational and Normative Social Influence in Buyer Behavior,” Journal of Consumer Research, vol. 2. Oxford University Press, pp. 206–215.
    [6] P.-J.Lee, Y.-H.Hu, and K.-T.Lu, “Assessing the helpfulness of online hotel reviews: A classification-based approach,” Telemat. Informatics, vol. 35, no. 2, pp. 436–445, May2018.
    [7] B.Swar, T.Hameed, and I.Reychav, “Information overload, psychological ill-being, and behavioral intention to continue online healthcare information search,” Comput. Human Behav., vol. 70, pp. 416–425, May2017.
    [8] K.Lakshminarayan, S. A.Harp, and T.Samad, “Imputation of Missing Data in Industrial Databases,” Appl. Intell., vol. 11, pp. 259–275, 1999.
    [9] J.Leskovec Stanford Univ Anand Rajaraman, J. D.Ullman, A.Rajaraman, J.Leskovec, and J. D.Ullman ii, Mining of Massive Datasets. 2010.
    [10] Y.Laberge, Advising on Research Methods: A consultant’s Companion. 2008.
    [11] C.-F.Tsai and F.-Y.Chang, “Combining instance selection for better missing value imputation,” J. Syst. Softw., vol. 122, no. C, pp. 63–71, Dec.2016.
    [12] C.-F.Tsai, M.-L.Li, and W.-C.Lin, “A class center based approach for missing value imputation,” Knowledge-Based Syst., vol. 151, pp. 124–135, Jul.2018.
    [13] G. E. A. P. A.Batista and M. C.Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
    [14] P. J.García-Laencina, J.-L.Sancho-Gómez, and A. R.Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Comput. Appl., vol. 19, no. 2, pp. 263–282, Mar.2010.
    [15] D.Weathers, S. D.Swain, and V.Grover, “Can online product reviews be more helpful? Examining characteristics of information content by product type,” Decis. Support Syst., vol. 79, pp. 12–23, Nov.2015.
    [16] M.Siering, A.V.Deokar, and C.Janze, “Disentangling consumer recommendations: Explaining and predicting airline recommendations based on online reviews,” Decis. Support Syst., vol. 107, pp. 52–63, Mar.2018.
    [17] R. J. A.Little and D. B.Rubin, STATISTICAL ANALYSIS WITH MISSING DATA WILEY SERIES IN PROBABILITY AND STATISTICS. 2002.
    [18] J. M.Davis and D.Agrawal, “Understanding the role of interpersonal identification in online review evaluation: An information processing perspective,” Int. J. Inf. Manage., vol. 38, no. 1, pp. 140–149, Feb.2018.
    [19] Y.-H.Cheng and H.-Y.Ho, “Social influence’s impact on reader perceptions of online reviews,” J. Bus. Res., vol. 68, no. 4, pp. 883–887, Apr.2015.
    [20] C. M. K.Cheung and D. R.Thadani, “The impact of electronic word-of-mouth communication: A literature analysis and integrative model,” Decis. Support Syst., vol. 54, no. 1, pp. 461–470, Dec.2012.
    [21] J. M.Rensink, Ed., What motivates people to write online reviews and which role does personality play? 2013.
    [22] C.Forman, A.Ghose, and B.Wiesenfeld, “Examining the Relationship Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic Markets,” Inf. Syst. Res., vol. 19, no. 3, pp. 291–313, Sep.2008.
    [23] HASS and R.G., Effects of source characteristics on cognitive responses in persuasion. Erlbaum, 1981.
    [24] J. R.Quinlan, “UNKNOWN ATTRIBUTE VALUES IN INDUCTION,” in Proceedings of the Sixth International Workshop on Machine Learning, 1989, pp. 164–168.
    [25] M.Huisman, “Imputation of Missing Item Responses: Some Simple Techniques,” Qual. Quant., vol. 34, no. 4, pp. 331–351, 2000.
    [26] I.Barranco-Chamorro, M. D.Jiménez-Gamero, J. A.Mayor-Gallego, and J. L.Moreno-Rebollo, “A case-deletion diagnostic for penalized calibration estimators and BLUP under linear mixed models in survey sampling,” Comput. Stat. Data Anal., vol. 87, no. C, pp. 18–33, Jul.2015.
    [27] M. J.Colledge, J. H.Johnson, R.Pare, I. G.Sande, and S.Canada, “LARGE SCALE IMPUTATION OF SURVEY DATA,” J. Am. Stat. Assoc., vol. 82, no. 397, pp. 431–436, 1978.
    [28] “Sande, IG. Hot-deck procedures. in: WG Madow, I Olkin, H Nisselson, DB Rubin (Eds.) Incomplete Data in Sample Surveys. Volume 3. Academic Press, New York; 1983:339–349.”
    [29] “Ford, B.: An Overview of Hot Deck Procedures. In: Madow, W., Nisselson, H., Olkin, I. (eds.) Incomplete Data in Sample Surveys, Theory and Bibliographies, 2, pp. 185–207. Academic Press (1983).”
    [30] G.Kalton, “IMPUTING FOR MISSING SURVEY RESPONSES,” American Statistical Association, pp. 22–31, 1982.
    [31] R. R.Andridge and R. J. A.Little, “A Review of Hot Deck Imputation for Survey Non-response.,” Int. Stat. Rev., vol. 78, no. 1, pp. 40–64, Apr.2010.
    [32] J. F.Hair, Multivariate data analysis. Prentice Hall, 2010.
    [33] K.-H.Wang, “A New Method for Handling Missing Values in Large Databases by Integrating Clustering and Regression Techniques,” National Cheng Kung University, 2002.
    [34] J. Y.Nancy, N. H.Khanna, and K.Arputharaj, “Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework,” Comput. Stat. Data Anal., vol. 112, no. C, pp. 63–79, Aug.2017.
    [35] S.Zhang, “Nearest neighbor selection for iteratively kNN imputation,” J. Syst. Softw., vol. 85, no. 11, pp. 2541–2552, Nov.2012.
    [36] S.-B.Cho, “Towards Creative Evolutionary Systems with Interactive Genetic Algorithm,” Appl. Intell., vol. 16, no. 2, pp. 129–138, 2002.
    [37] Y.Liu, K.Wen, Q.Gao, X.Gao, and F.Nie, “SVM based multi-label learning with missing labels for image annotation,” Pattern Recognit., vol. 78, pp. 307–317, Jun.2018.
    [38] R.Pandya, J.Pandya, K. P.Dholakiya, and I.Amreli, “C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning,” Int. J. Comput. Appl., vol. 117, no. 16, pp. 975–8887, 2015.
    [39] A.Ni, X.Zhu, and C.Zhang, “Any-Cost Discovery: Learning Optimal Classification Rules,” Springer, Berlin, Heidelberg, 2005, pp. 123–132.
    [40] C. X.Ling, Q.Yang, J.Wang, and S.Zhang, “Decision trees with minimal costs,” in Twenty-first international conference on Machine learning - ICML ’04, 2004, p. 69.
    [41] C. X.Ling, V. S.Sheng, and Q.Yang, “Test strategies for cost-sensitive decision trees,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 8, pp. 1055–1067, Aug.2006.
    [42] R.Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Appear. Int. Jt. Conf. Articial Intell., vol. 2, pp. 1137–1143, 1995.

    QR CODE
    :::