跳到主要內容

簡易檢索 / 詳目顯示

研究生: 楊鎧謙
Kai-Qian Yang
論文名稱: On Large-Scale Multi-Label Classification for POI Tagging
指導教授: 張嘉惠
Chia-Hui Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 30
中文關鍵詞: 機器學習多類別分類非平衡資料興趣點
外文關鍵詞: Machine Learning, Multi Label Classification, Unbalanced Data, point of interest
相關次數: 點閱:8下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來智慧型手持裝置迅速普及,現在已經達到幾乎人手一機的情況。而交通方式的進步更是使得人們移動的機率大幅增加,因此到陌生地點的機會也跟著增加。在陌生的環境之中要尋找感興趣的點是不容易的,所以需要提供電子地圖系統以便查詢。電子地圖如果只提供名稱搜尋是不夠的,因為使用者可能不知道這些點的確切名稱,他們可能只是想找特定類型的點,所以一個好的電子地圖需要提供類別搜尋服務。
    為了要提供類別搜尋服務,我們需要將系統中所有的點進行分類。因為系統中有許多筆資料,每筆資料都有一個或多個類別,所以這是一個大數量的多類別分類問題。地圖上的這些資料通常有許多種分類方式,我們使用中華黃頁的分類方式。類別包含兩個等級,等級一類別有29種類別而等級二類別則有1,287種。因為類別與資料較多使得一般訓練分類器的方式需要訓練多個分類器,導致訓練與測試時間增加許多。我們利用降低類別維度的方式來加快訓練與測試的速度。
    實驗顯示採用KDE+SVM的混合模型方式的訓練時間與測試時間皆比一般的SVM分類快幾乎一倍,對29個大類別Micro-F1可達0.813,等級二類別的Micro-F1為0.718僅略低於SVM在等級一類別的Micro-F1 0.842,等級二類別的Micro-F1 0.783。由於資料為imbalanced data我們比較了Reweighting和Downsampling的方式想增進效能,但其結果顯示在大數量的資料中這兩個方法效果較不明顯。


    In recent years, mobile device become more popular. And due to convenient transportation, people have higher probability to visit strange places. It is not easy to find a point of interest in a strange places, so we need to provide an electronic map system for users. It is not enough to provide name search for users only, because the users may not know the exact name of points. They may just want to find a specific category of point, so a good electronic map system needs to provide category search service.
    In order to provide category search services, we need to classify all the points in the system. Because the system has many points, each item has one or more categories, so this is a large-scale multi-label classification problem. There are many kind of categories, we follow the categories defined by Chinese yellow pages. The category consists two levels. There are 29 categories in level 1and 1,287 in level 2. Because the number of points and categories are large, we need to spend much time for training classifiers and testing data. We reduce the dimension of categories to speed up training and testing.
    After the experiment, our method’s training time and testing time are superior to the general SVM classification, the performance in level 1 Micro-F1 is 0.813, in level 2 Micro-F1 is 0.718 all slightly lower than SVM in level 1 Micro-F1 is 0.842. In level 2 Micro-F1 is 0.783. We want to try Reweighting, Downsampling to improve performance, but the performance is not wall in large-scale data.

    中文摘要..........................i Abstract.........................ii 圖目錄............................iv 表目錄.............................v 一、 緒論.......................1 1.1 研究動機與目的...............1 1.2 多標籤分類...................3 1.3 章節概要.....................3 二、 相關研究....................4 三、 系統架構....................6 3.1 資料前處理...................6 3.2 KDE-based Classification 8 3.3 資料測試....................11 四、 實驗結果....................12 4.1 資料集描述...................12 4.2 評估方式.....................14 4.3 實驗分析與討論...............15 4.3.1 β值對KDE結果的影響.........15 4.3.2 訓練工具比較...............16 4.3.3 訓練與測試時間結果..........16 4.3.4 各個方法效能結果............18 4.3.5 SVM的非平衡資料改善實驗......20 4.3.6 額外特徵影響................20 五、 結論與未來工作 ...............22 六、 參考文獻.....................23

    [1] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2(27), pp. 1–27, 2011.
    [2] Q Chen, et al. “Improvement of Kernel Dependency Estimation and Case Study on Skewed Data.” National Central University, 2013
    [3] Fan, Rong-En, Pai-Hsuen Chen, and Chih-Jen Lin. "Working set selection using second order information for training support vector machines." Journal of machine learning research 6.Dec (2005): 1889-1918.
    [4] Fan, Rong-En, et al. "LIBLINEAR: A library for large linear classification." Journal of machine learning research 9.Aug (2008): 1871-1874.
    [5] Godbole, Shantanu, and Sunita Sarawagi. "Discriminative methods for multi-labeled classification." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2004.
    [6] Tang, Lei, Suju Rajan, and Vijay K. Narayanan. "Large scale multi-label classification via metalabeler." Proceedings of the 18th international conference on World wide web. ACM, 2009.
    [7] Wallace, Byron C., et al. "Class imbalance, redux." Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 2011.
    [8] Weston, Jason, et al. "Kernel dependency estimation." Advances in neural information processing systems. 2003.
    [9] Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization." Icml. Vol. 97. 1997.

    QR CODE
    :::