基於字詞關係動態建立階層分群｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳信夫 Hsin-fu Chen
論文名稱：	基於字詞關係動態建立階層分群 Dynamic Hierarchical Clustering Based on Taxonomy
指導教授：	林熙禎 Shi-jen Lin
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理學系 Department of Information Management
畢業學年度：	99
語文別：	中文
論文頁數：	58
中文關鍵詞：	階層分群演算法、動態分群演算法、分類學、文件分群
外文關鍵詞：	Dynamic clustering algorithm, Hierarchical clustering, Taxonomy
相關次數：	點閱：6 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

資訊爆炸時代的來臨，越來越多使用者在網路上搜尋相關資料進行閱讀。本研究目標是將大量文件資料進行階層分群（Hierarchical Clustering），並以字詞關係建置具有上下包含關係的分類學（Taxonomy），以用來成為階層群集的標籤。運用上，能方便使用者快速瞭解文件集有哪些主題，迅速選擇所需主題的文件進行閱讀。本研究提出的系統架構有效地改善了階層群集研究上的五個議題：高維度的向量、動態的特徵選取與文件分群、文件處理順序、文件跨領域分群與群集標籤之間的關係。

With the popularity of Internet, the World Wide Web contains a giant amount of information. To search relevant information from large number of texts becomes a challenge to the users. Hierarchical clustering is one of the methods to conquer this problem. Because its features let users can browse the topic gradually and find out the most relevant documents they have interesting. But there are still have some challenge in hierarchical clustering must be addressed, like high dimensionality of the data, dynamic data sets, the sensitivity of input order, documents has several concept, and the relationship of clusters and tags.
In this paper, we propose an approach of dynamic hierarchical clustering based on taxonomy to conquer those challenges. The experimental result shows that our method not only suitable for constructing hierarchical clustering in dynamic data sets, but also offer a easier structure to browse than traditional algorithms, BKM and UPGMA. In addition, the clusters are labeled meaningful tags with the relationship of containment can let users understand the whole concept of clusters rapidly.

摘 要 i
Abstract ii
誌 謝 iii
目 錄 iv
圖目錄 vii
表目錄 ix
第一章 緒論 1
1 研究動機 1
2 研究目的 2
3 研究方法 3
4 論文架構 4
第二章 文獻探討 5
1 特徵選取 5
1.1 詞彙頻率（Term Frequency, TF） 5
1.2 詞彙頻率與反向文件頻率（TF-IDF） 5
1.3 高頻項目集（Frequent Itemset） 7
1.4 資訊關聯（Mutual Information） 7
1.5 正規化谷歌距離（NGD, Normalized Google Distance） 8
2 分群演算法 9
2.1 切割式群集演算法 10
2.2 凝聚式階層分群演算法(Agglomerative Hierarchical Clustering) 11
2.3 分裂式階層分群演算法（Divisive Hierarchical Clustering） 13
3 分類學 15
3.1 詞彙句法樣式法（Lexico-syntactic Patterns） 16
3.2 機器可讀字典（Machine-readable Dictionaries） 17
3.3 資訊理論 (Information Theory) 18
4 小結 18
第三章 系統設計與架構 19
1 系統架構 19
2 資料前處理 20
2.1 Part-of-speech and word combination 20
2.2 The length of the word 21
2.3 The number of Google search results 21
2.4 NGD Calculate 22
2.5 Ranking and Filtering 23
3文件概念分群 24
3.1 Updated Beta-similarity Graph 25
3.2 Updated Max-S Graph 26
3.3 Updated Star Cover 27
4 建置分類學 28
4.1 NGD Calculate 28
4.2 Conditional Probability Calculate 29
4.3 BTRank 30
5文件階層分群 33
第四章 實驗結果與討論 36
1 資料集介紹 36
1.1 Wikipedia（維基百科） 36
1.2 MeSH（Medical Subject Headings） 37
1.3 Painters and Paintings 38
1.4 資料集與實驗的對應 39
2 評估方法 39
2.1 F1 score 39
2.2 Fβ score 40
2.3 FCubed 41
3 資料前處理實驗結果 43
4 建置分類學實驗結果 44
5 文件概念分群與文件階層分群實驗結果 46
6 階層結構分析 48
7 系統效能分析 49
7.1 時間複雜度 49
7.2 系統總體時間分析 50
第五章 結論與未來研究方向 52
1 結論 52
2 未來研究方向 53
參考文獻 55
中文部分 55
英文部分 55
網頁部分 58

                                

1. 王千豪（民96），基於近似詞彙樣式匹配與共現關聯度之文件分群，未出版碩士論文，私立大同大學資訊經營學系(所)。
2. 張家寧（民98），以概念萃取為基礎之文件分群與視覺化，未出版碩士論文，國立交通大學資訊科學與工程研究所。
3. 楊雅婷、阮明淑（民95）, 「分類相關概念之術語學研究」, 國家圖書館館刊, No. 2, 25-50。
4. 陳志豐（民97），基於高頻項目集結合近似樣式匹配之文件分群，未出版碩士論文，私立大同大學資訊經營學系(所)。
5. 潘麒全（民92），可修正的二分群集法，未出版碩士論文，私立中原大學資訊管理研究所。
6. Amigo, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr., 12(4), 461-486.
7. Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada.
8. Berland, M., & Charniak, E. (1999). Finding parts in very large corpora. Paper presented at the Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland.
9. Caramia, M., Felici, G., & Pezzoli, A. (2004). Improving search results with data mining in a thematic search engine. Comput. Oper. Res., 31(14), 2387-2404.
10. Chen, P.-I., & Lin, S.-J. (2010). Automatic keyword prediction using Google similarity distance. Expert Systems with Applications, 37(3), 1928-1938.
11. Chung, S., & McLeod, D. (2005). Dynamic Pattern Mining: An Incremental Data Clustering Approach (pp. 85-112).
12. Cilibrasi, R. L., & Vitanyi, P. M. B. (2007). The Google Similarity Distance. IEEE Trans. on Knowl. and Data Eng., 19(3), 370-383.
13. Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. Paper presented at the Proceedings of the 14th conference on Computational linguistics - Volume 2, Nantes, France.
14. Henschel, A., Woon, W. L., Wachter, T., & Madnick, S. (2009). Comparison of generality based algorithm variants for automatic taxonomy generation. Paper presented at the Proceedings of the 6th international conference on Innovations in information technology, AI-Ain, United Arab Emirates.
15. Heymann, P., & Garcia-Molina, H. (2006). Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems.
16. Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. Paper presented at the Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, San Diego, California, United States.
17. Lin, F.-r., & Hsueh, C.-m. (2003, 6-9 Jan. 2003). Knowledge map creation and maintenance for virtual communities of practice. Paper presented at the System Sciences, 2003. Proceedings of the 36th Annual Hawaii International Conference on.
18. Lin, F.-r., & Yu, J.-H. (2009). Visualized cognitive knowledge map integration for P2P networks. Decis. Support Syst., 46(4), 774-785.
19. Makrehchi, M., & Kamel, M. S. (2007). Automatic Taxonomy Extraction Using Google and Term Dependency. Paper presented at the Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence.
20. Oliveira, A., Pereira, F., & Cardoso, A. (2002). Automatic Reading and Learning from Text. Paper presented at the Symposium on Artificial Intelligence.
21. Ong, T.-H., Chen, H., Sung, W.-k., & Zhu, B. (2005). Newsmap: a knowledge map for online news. Decision Support Systems, 39(4), 583-597.
22. Rajaraman, K., & Tan, A.-H. (2002). Knowledge discovery from texts: a concept frame graph approach. Paper presented at the Proceedings of the eleventh international conference on Information and knowledge management, McLean, Virginia, USA.
23. Reynaldo, G.-G., & Aurora, P.-P. (2010). Dynamic hierarchical algorithms for document clustering. Pattern Recognition Letters, 31(6), 469-477.
24. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5), 513-523.
25. Sanderson, M., & Croft, B. (1999). Deriving concept hierarchies from text. Paper presented at the SIGIR ''99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval.
26. Shih, J.-Y., Chang, Y.-J., & Chen, W.-H. (2008). Using GHSOM to construct legal maps for Taiwan''s securities and futures markets. Expert Syst. Appl., 34(2), 850-858.
27. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques.
28. Tsui, E., Wang, W. M., Cheung, C. F., & Lau, A. S. M. (2010). A concept-relationship acquisition and inference approach for hierarchical taxonomy construction from tags. Inf. Process. Manage., 46(1), 44-57.
29. Widyantoro, D. H., Ioerger, T. R., & Yen, J. (2002). An Incremental Approach to Building a Cluster Hierarchy. Paper presented at the Proceedings of the 2002 IEEE International Conference on Data Mining.
30. Wong, W., & Fu, A. (2000). Incremental Document Clustering for Web Page Classification.
31. Woon, W. L., & Madnick, S. (2009). Asymmetric information distances for automated taxonomy construction. Knowl. Inf. Syst., 21(1), 91-111.
32. Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., & Liu, X. (1999). Learning Approaches for Detecting and Tracking News Events. IEEE Intelligent Systems, 14(4), 32-43.
33. Zhang, W., Yoshida, T., Tang, X., & Wang, Q. (2010). Text clustering using frequent itemsets. Knowledge-Based Systems, 23(5), 379-388.
34. 視覺素養學習網（無日期），2011年5月21日取自http://vr.theatre.ntu.edu.tw/fineart/index.html。
35. 國際數據資訊公司（2010），2011年5月21日取自http://www.idc.com/。
36. Medical Subject Headings（2011），2011年5月21日取自http://www.nlm.nih.gov/mesh/。
37. Wikipedia（2001），2011年5月21日取自http://www.wikipedia.org/。

簡易檢索 / 詳目顯示

相關論文