跳到主要內容

簡易檢索 / 詳目顯示

研究生: 張哲瑋
Che-wei Chang
論文名稱: 針對文字分類的支援向量導向樣本選取
Support Vector Oriented Instance Selection for Text Classification
指導教授: 李俊賢
Chun-shien Li
蔡志豐
Chih-Fong Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理學系
Department of Information Management
畢業學年度: 99
語文別: 中文
論文頁數: 56
中文關鍵詞: 機器學習支援向量機文字分類資料縮減樣本選取
外文關鍵詞: support vector machines, machine learning, text classification, data reduction, instance selection
相關次數: 點閱:20下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 樣本選取 (instance selection) 在資料探勘領域的一門技術,但是對於現今持續增長的資料量,卻很少人著重在樣本選取,而本研究提出了一個基於支援向量機 (Support Vector Machine, SVM)概念發展出的一個樣本選取演算法稱為SVOIS。
    而且是針對於文字分類上進行樣本選取,此外也與幾個有名的樣本選取演算法ENN、IB3、ICF和DROP3這些演算法進行比較。在分類器的選擇上,也較這些方法不同,本篇論文不只有使用k-NN這個作為分類器,還有使用一個二分類的分類器支援向量機SVM作為分類器的比較依據;因為對於SVM而言,在訓練的時候時常需要花費很長的時間,而且時間是隨著樣本的增加而增長,所以我們認為SVOIS不只會對SVM有所幫助,還可能會對於k-NN有較其他樣本選取演算法更有幫助。
    最後,透過實驗二分類的文字資料集來進行實驗,也分別實作出其他這個演算法來進行比較,以驗證SVOIS是較其他樣本選取演算法來的佳。實驗結果也發現,SVOIS針對在文字資料集上樣本選取後的正確率較其他演算法來的高,也能改善其資料量。


    Since the number and size of online information are increasing rapidly, instance selection has become one of the major techniques for managing text data. In this paper, a novel instance selection method, namely Support Vector Oriented Instance Selection (SVOIS) is proposed for text classification.
    SVOIS attempts to find the support vectors in the original feature space through a linear regression plane, where the instances to be selected as the support vectors need to satisfy two criteria. The first one is that the distances between the original instances and their class centers need to be smaller than a pre-defined value. Then, the instances fulfilling this criterion are regarded as the regression data in order to identify a regression plane. The second criterion is based on the distances between the regression data and the regression plane, which is like the margin of SVM. In particular, these distances need to be larger than a pre-defined value, and the regression data fulfilling this criterion are called support vectors for classifier training and classification. More specifically, these two types of distances should not be neither too long to make all instances to be selected, nor too short leading to very few support vectors.
    In particular, this paper compares SVOIS with four state-of-the-art algorithms, which are ENN, IB3, ICF, and DROP3. The experimental results over the TechTC-100 dataset show that SVOIS can allow SVM and k-NN provide similar or better classification accuracy than the baseline without instance selection and it also outperforms the state-of-the-art algorithms in terms of effectiveness and efficiency.

    摘要 i Abstract ii 目錄 iii 圖目錄 v 表目錄 vi 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 1 1.3 研究範圍 2 1.4 研究貢獻 2 1.5 章節架構 3 第二章 文獻探討 5 2.1 樣本選取 5 2.2 Edited Nearest Neighbor 5 2.3 Instance-Based Learning 6 2.4 Iterative Case Filtering 7 2.5 Decremental Reduction Optimization Procedure 8 2.6 討論 10 第三章 Support Vector Oriented Instance Selection 11 3.1 第一階段 11 3.2 第二階段 12 3.3 第三階段 13 3.4 第四階段 14 3.5 SVOIS詳細演算法 15 3.6 SVOIS多分類之情況 16 第四章 實驗結果 17 4.1 實驗設計 17 4.2 研究結果 18 第五章 結論 30 5.1 研究貢獻 30 5.2未來研究 31 參考文獻 32 附錄一 35 附錄二 38 附錄三 41 附錄四 44

    [1] Aggarwal, CC. and Yu, P.C. (2001) Outlier detection for high dimensional data. Proceedings of the ACM SIGMOD Conference, pp. 37-46.
    [2] Aha, D.W., Kibler, D., and Albert, M.K. (1991) Instance-based learning algorithms. Machine Learning, vol. 6, no. 1, pp. 37-66.
    [3] Barnett, V. and Lewis, T. (1994) Outliers in statistical data. John Wiley & Sons.
    [4] Brank, J., Grobelnik, M., Milic-Frayling, N., and Mladenic, D. (2002) Interaction of feature selection methods and linear classification models. International Workshop on Text Mining, in conjunction with International Conference on Machine Learning.
    [5] Brighton, H. and Mellish, C. (2002) Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, vol. 6, pp. 153-172.
    [6] Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167.
    [7] Byun, H. and Lee, S.-W. (2003) A survey on pattern recognition applications of support vector machines. International Journal of Pattern Recognition and Artificial Intelligence, vol. 17, no. 3, pp. 459-486.
    [8] Cano, J.R., Herrera, F., and Lozano, M. (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575.
    [9] Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M.W. (2007) Feature selection methods for text classification. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230-239.
    [10] Davidov, D., Gabrilovich, E., and Markovitch, S. (2004) Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250-257.
    [11] Derrac, J., Garcia, S., and Herrera, F. (2010) A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing, vol. 1, no. 1, pp. 60-92.
    [12] Forman, G. (2003) An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, vol. 3, pp. 1289-1305.
    [13] Gabrilovich, E. and Markovitch, S. (2004) Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. International Conference on Machine Learning, pp. 321-328.
    [14] Garcia-Pedrajas, N., del Castillo, J.A.R., and Ortiz-Boyer, D. (2010) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning, vol. 78, pp. 381-420.
    [15] Jain, A.K., Duin, R.P.W., and Mao, J. (2000) Statistical pattern recognition: a review. IEEE Transitions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37.
    [16] Jankowski, N. and Grochowski, M. (2004) Comparison of instances selection algorithms I: algorithms survey. International Conference on Artificial Intelligence and Soft Computing, pp. 598-603.
    [17] Joachims, T. (1998) Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning, pp. 137-142.
    [18] Knorr, E.M., Ng., R., and Tucakov, V. (2000) Distance-based outliers: algorithms and applications. The VLDB Journal, Vol. 8, pp. 237-253.
    [19] Lewis, D.D. and Hayes, P.J. (1994) Guest editorial – special issue on text categorization. ACM Transactions on Information Systems, vol. 12, no. 3, pp. 231.
    [20] Li, X.-B. and Jacob, V.S. (2008) Adaptive data reduction for large-scale transaction data. European Journal of Operational Research, vol. 188, no. 3, pp. 910-924.
    [21] Liu, H. and Motoda, H. (2001) Instance selection and construction for data mining. Kluwer.
    [22] Pyle, D. (1999) Data preparation for data mining. Morgan Kaufmann.
    [23] Pradhan, S. and Wu, X. (1999) Instance selection in data mining. Technical Report, Department of Computer Science, University of Colorado at Boulder.
    [24] Reinartz, T. (2002) A unifying view on instance selection. Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
    [25] Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, no. 1, pp. 1-47.
    [26] Tsai, C.-F., McGarry, K., and Tait, J. (2006) CLAIRE: a modular support vector image indexing and classification system. ACM Transactions on Information Systems, vol. 24, no. 3, pp. 353-379.
    [27] Vapnik, V. (1998) Statistical learning theory. John Wiley.
    [28] Wilson, D.L. (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, vol. 2, no. 3, pp. 408-421.
    [29] Wilson, D.R. and Martinez, T.R. (2000) Reduction techniques for instance-based learning algorithms. Machine Learning, vol. 38, pp. 257-286.
    [30] Yang, Y. and Liu, X. (1999) A re-examination of text categorization methods. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42-49.
    [31] Yang, Y. and Pedersen, J.O. (1997) A comparative study on feature selection in text categorization. International Conference on Machine Learning, pp. 412-420.

    QR CODE
    :::