跳到主要內容

簡易檢索 / 詳目顯示

研究生: 藺禹筑
Yu-Zhu Lin
論文名稱: A Compression-Based Partitioning Estimate Classifier
指導教授: 陳春樹
Chun-Shu Chen
張明中
Ming-Chung Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 理學院 - 統計研究所
Graduate Institute of Statistics
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 72
中文關鍵詞: 資料壓縮集群分析演算法分割估計法
外文關鍵詞: data compression, k-means algorithm, partitioning estimate
相關次數: 點閱:26下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 分類問題在金融業、電商業抑或是醫療業無處不在。舉例來說,金融業透過儲戶的年齡、年收入、教育和歷史還款紀錄來預測其信用評等,而這些信用評等屬於類別型變數。此外,深度學習模型的蓬勃發展也反映出分類問題的重要性。另一方面,在電腦資源的限制下,伴隨著資料量的快速成長,多樣的資料縮減方法不斷地被提出。在本篇論文中,我們利用資料縮減的概念發展出適用於分類問題的預測模型,此外,也透過模擬與實際案例以展示我們提出的方法。


    In financial, telecom, or medical industry, classification problems are ubiquitous. For example, the financial industry predicts a depositor's credit rating based on several input variables such as age, annual income, education, and repayment history, where the responses are qualitative. More and more deep learning models are developed for such purposes, reflecting the importance of classification problems. On the other hand, with the rapid growth of data size given limited computer resources, various data reduction methods have been innovated. In this thesis, we utilize a concept of data reduction to develop a classification predictor. We illustrate the proposed method through simulations and real examples.

    Contents 中文摘要...i Abstract...ii Contents...iii List of Figures...iv List of Tables...viii 1 Introduction...1 2 Literature Review...3 3 Methodology...6 3.1 Supercompress...6 3.2 PEC...11 4 Simulation...15 4.1 Supercompress vs. SRS...16 4.2 Predictive Efficiency under Five Models...31 4.3 Other Criteria...42 4.4 PEC vs. KNN with Different k Value...44 5 Real Applications...53 5.1 Small Data...53 5.2 Big Data...55 6 Conclusion...56 References...57

    Chenlu Shi, and Boxin Tang (2021). Model-robust subdata selection for big data, Journal of Statistical Theory and Practice. 15(82).
    Elizabeth D Schifano, Jing Wu, Chun Wang, Jun Yan, and Ming-Hui Chen (2016). Online updating of statistical inference in the big data setting, Technometrics, 58(3), 393–403.
    Erchin Serpedin, Thomas Chen and Dinesh Rajan (2012). Mathematical Foundations for Signal Processing, Communications, and Networking, CRC Press, 381-385.
    Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An Introduction to Statistical Learning : with Applications in R, Springer, New York, NY.
    HaiYing Wang, Min Yang, and John Stufken (2018). Information-based optimal subdata selection for big data linear regression, Journal of The American Statistical Association, 114(525), 393-405.
    HaiYing Wang, Rong Zhu, and Ping Ma (2018). Optimal subsampling for large sample logistic regression, Journal of The American Statistical Association, 113(522), 829–844.
    Leo Breiman (2001). Random forests, Machine Learning, 45, 5-32.
    Lin Wang, Jake Elmstedt, Weng Kee Wong, and Hongquan Xu (2021). Orthogonal subsampling for big data linear regression, Annals of Applied Statistics, 15(3), 1273-1290.
    Nan Lin, and Ruibin Xi (2011). Aggregated estimating equation estimation, Statistics and Its Interface, 4(1), 73–83.
    Petros Drineas, Michael W. Mahoney, S. Muthukrishnan (2006). Sampling algorithms for l2 regression and applications, SODA ’06: Proceedings of The Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, 1127-1136. 57
    Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition), Springer-Verlag.
    V. Roshan Joseph and Akhil Vakayil (2021). SPlit: an optimal method for data splitting, Technometrics, 64(2), 166-176.
    V. Roshan Joseph, and Simon Mak (2021). Supervised compression of big data, Statistical Analysis and Data Mining, 14(3), 217-229.
    William Fithian and Trevor Hastie (2014). Local case-control sampling: efficient subsampling in imbalanced data sets, Annals of Statistics, 42(5), 1693–1724.
    Yaqiong Yao, and HaiYing Wang (2020). A review on optimal subsampling methods for massive datasets, Journal of Data Science, 19(1), 151–172.
    Yaqiong Yao, and Ying Wang (2021). A selective review on statistical techniques for big data, Modern Statistical Methods for Health Research, 223-245.
    Zizhu Fan, Yong Xu, and David Zhang (2011). Local linear discriminant analysis framework using sample neighbors, IEEE Transactions on Neural Networks, 22(7), 1119-1132.

    QR CODE
    :::