| 研究生: |
詹欣逸 Hsin-Yi Chan |
|---|---|
| 論文名稱: |
利用WordNet 判斷字詞包含關係─ 應用於動態階層文件分群 Using WordNet to Infer Containment Relationship─ Applied to Dynamic Hierarchical Clustering |
| 指導教授: |
林熙禎
Shi-Jen Lin |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 中文 |
| 論文頁數: | 73 |
| 中文關鍵詞: | 字詞包含關係 、動態分群 、分類學 、階層分群 |
| 外文關鍵詞: | Containment relationship |
| 相關次數: | 點閱:14 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資訊爆炸的現在及未來,不管企業或個人都需要方法有效組織資訊。本研究目標
在於將文件動態進行階層分群,以便使用者整理、瀏覽及搜尋日積月累的海量資訊。其做法結合了分群、分類及分類學。本研究以Dynamic Hierarchical Clustering Based on Taxonomy (DHCT) 之動態分群架構為基礎去改善內部方法。其中我們改變了文件相似度的計算以降低分群複雜度,而分類學方面本研究提出以WordNet 中兩字詞路徑來判斷包含關係的方法,透過使用MeSH_2011 資料集的實驗證明能有效建立分類學,以分類學為基礎的做法則能成功建立上下包含的階層分群目錄標籤,解決階層分群無法產生有意義標籤之瓶頸。此外本研究改善DHCT 所提出推論字詞包含關係之Conditional Probability (CP) 計算,發現加入適當標籤確實能幫助CP 正確推論有一字多義的字詞之包含關係,而結合WordNet 及Google 建置分類學的做法不僅提高整體正確率,更能突破WordNet 本身字詞不足之限制。DHCT 已解決文獻中面臨的許多問題並證實優於傳統的UPGMA 及BKM 階層分群法,經過本研究的改良更將分群複雜度由O(nm2)降至O(nm),並透過使用Wikipedia 資料集的實驗證明能改善DHCT 約20%的F1Score,產生更正確、更能幫助使用者瀏覽之階層分群結構。
The number of text document managed in business and personal computer continues to grow at an amazing speed. We need an efficient way to organize, manage, access, search and browse such large repositories of documents. One of the popular techniques is dynamic hierarchical clustering, which is our focus. This study improves the taxonomy method in Dynamic Hierarchical Clustering Based on Taxonomy (DHCT), and our framework is mainly based on it, which combines the techniques of clustering, classification, and taxonomy. In order to reduce the time complexity, we use Ward’s minimum variance and NGD to calculate document similarity. We also proposed two methods to infer containment relationships between terms for building taxonomy. One is called CR, which makes use of term paths in the WordNet. Another called CP+Label, which improves the Conditional Probability (CP) (proposed by DHCT) by adding an appropriate label when the term is polysemy. These taxonomies are later used as the cluster label to let users easier to browse and search. DHCT had been proved that it is better than the traditional method: UPGMA and BKM, and our experiment results on MeSH_2011 show that both methods we proposed are suitable for producing a meaningful taxonomy and are better than DHCT as well. Moreover, by merging the taxonomies constructed by WordNet and Google, our method not only improves about 20% of the overall F1 score on Wikipedia text collection, but also breaks through the limits while using the WordNet.
﹝1﹞國際數據資訊公司:2012 年 12 月8 日, Retrieved from http://www.idc.com/.
﹝2﹞Pons-Porrata, Aurora, Rafael Berlanga-Llavori and José Ruiz-Shulcloper, "Topic discovery based on text mining techniques", Information Processing & Management, Vol 43, 3, pp. 752-768, 2007.
﹝3﹞Gil-Garcia, R. and A. Pons-Porrata, "Dynamic hierarchical algorithms for document clustering", Pattern Recognition Letters, Vol 31, 6, pp. 469-477, Apr 2010.
﹝4﹞Tseng, Yuen-Hsien, "Generic title labeling for clustered documents", Expert Systems with Applications, Vol 37, 3, pp. 2247-2254, 2010.
﹝5﹞陳信夫, 「基於字詞關係動態建立階層分群」, 國立中央大學, 資訊管理研究所
碩士論文, 2011.
﹝6﹞Lee Sangno, Huh Soon-Young and McNiel Ronald D., "Automatic generation of concept hierarchies using WordNet", Expert Systems with Applications, Vol 35, 3, pp. 1132-1144, 2008.
﹝7﹞Princeton, University:"About WordNet", 2012 April 05, Retrieved from http://wordnet.princeton.edu/wordnet/.
﹝8﹞Wang, Li, Masao Fuketa, Kazuhiro Morita and Jun-ichi Aoe, "Context constraint disambiguation of word semantics by field association schemes", Information Processing & Management, Vol 47, 4, pp. 560-574, 2011.
﹝9﹞Ehsan Hessami, Faribourz Mahmoudi and Amir Hossien Jadidinejad, "Unsupervised Graph-based Word Sense Disambiguation Using lexical relation of WordNet", International Journal of Computer Science Issues, Vol 8, 6, pp. 1694-0814, 2011.
﹝10﹞Chua, Stephanie and Narayanan Kulathuramaiyer, "Semantic Feature Selection Using WordNet", Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 166-172, 1026315, 2004 of Conference.
﹝11﹞Cilibrasi, Rudi L. and Paul M. B. Vitanyi, "The Google Similarity Distance", IEEE Trans. on Knowl. and Data Eng., Vol 19, 3, pp. 370-383, 2007.
﹝12﹞Manning, Christopher D., Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press, 2008.
﹝13﹞Salton, Gerard and Christopher Buckley, "Term-weighting approaches in automatic text retrieval", Inf. Process. Manage., Vol 24, 5, pp. 513-523, 1988.
﹝14﹞Yang, Yiming and Jan O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization", Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412-420, 657137, 1997 of Conference
﹝15﹞Chen, Ping- I. and Shi-Jen Lin, "Automatic keyword prediction using Google similarity distance", Expert Systems with Applications, Vol 37, 3, pp. 1928-1938, 3/15/ 2010.
﹝16﹞Chen, Chun-Ling, Frank S. C. Tseng and Tyne Liang, "An integration of WordNet and fuzzy association rule mining for multi-label document clustering", Data & Knowledge Engineering, Vol 69, 11, pp. 1208-1226, 2010.
﹝17﹞Salton, Gerard, "Dynamic document processing", Communications of the ACM, Vol 15, 7, pp. 658-668, 1972.
﹝18﹞Chiu Wong, Wai and Ada Wai Chee Fu, "Incremental Document Clustering for Web Page Classification", Proceedings of 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), pp. wong00incremental, 2000 of Conference
﹝19﹞Widyantoro, Dwi H., Thomas R. Ioerger and John Yen, "An Incremental Approach to Building a Cluster Hierarchy", Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 705, 844777, 2002 of Conference
﹝20﹞Chung, Seokkyung and Dennis McLeod, Dynamic Pattern Mining: An Incremental Data Clustering Approach, 3360, Springer Berlin Heidelberg, 2005.
﹝21﹞G.V.R, Kiran, Ravi Shankar and Vikram Pudi, Frequent Itemset Based Hierarchical Document Clustering Using Wikipedia as External Knowledge, 6277, Springer Berlin
Heidelberg, 2010.
﹝22﹞Ward, Jr., "Hierarchical grouping to optimize an objective function", Journal of the American Statistical Association, Vol 58, pp. 236-244, 1963.
﹝23﹞Padilla, Guillermo, María Elena Cartea and Amando Ordás, "Comparison of
Several Clustering Methods in Grouping Kale Landraces", Journal of the American Society for Horticultural Science, Vol 132, 3, pp. 387-395, May 1, 2007 2007.
﹝24﹞Li, Xin, Jun Yan, Weiguo Fan, Ning Liu, Shuicheng Yan and Zheng Chen, "An online blog reading system by topic clustering and personalized ranking", ACM Transactions on Internet Technology, Vol 9, 3, pp. 1-26, 2009.
﹝25﹞Rao, Sujatha R ; Bandaru Rama krishna, "TAXONOMY CONSTRUCTION TECHNIQUES – ISSUES AND CHALLENGES", Indian Journal of Computer Science and Engineering, Vol 2, 5, pp. 661-671, 2011.
﹝26﹞Tsui, Eric, W. M. Wang, C. F. Cheung and Adela S. M. Lau, "A concept–relationship acquisition and inference approach for hierarchical taxonomy construction from tags", Information Processing & Management, Vol 46, 1, pp. 44-57,
1/ 2010.
﹝27﹞Rafea, Maryam Hazman and Samhaa R. El-Beltagy and Ahmed, "Article: A Survey of Ontology Learning Approaches", International Journal of Computer Applications, Vol 22, 8, pp. 36-43, May 2011.
﹝28﹞Ponzetto, Simone Paolo and Michael Strube, "Taxonomy induction based on a collaboratively built knowledge repository", Artificial Intelligence, Vol 175, 9-10, pp. 1737-1756, 2011.
﹝29﹞Sang, Erik F. Tjong Kim, "Memory-based shallow parsing", J. Mach. Learn. Res., Vol 2, pp. 559-594, 2002.
﹝30﹞Klapaftis, Ioannis P. and Suresh Manandhar, "Taxonomy learning using word sense
induction", Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 82-90, Los Angeles, California, 1858009, 2010 of Conference
﹝31﹞Woon, WeiLee and Stuart Madnick, "Asymmetric information distances for automated taxonomy construction", Knowledge and Information Systems, Vol 21, 1, pp. 91-111, 2009/10/01.
﹝32﹞Oxford Advanced American Dictionary, 2012 April 05, Retrieved from http://oaadonline.oxfordlearnersdictionaries.com/.
﹝33﹞Banerjee, Satanjeev and Ted Pedersen, "Extended gloss overlaps as a measure of semantic relatedness", Proceedings of the 18th international joint conference on Artificial intelligence, pp. 805-810, Acapulco, Mexico, 1630775, 2003.
﹝34﹞Graeme, Hirst and St-onge David, Lexical Chains as Representations of Context for the Detection, WordNet: An Electronic Lexical Database, C. Fellbaum, The MIT Press, 1998.
﹝35﹞Pedersen, Ted, Siddharth Patwardhan and Jason Michelizzi, "WordNet::Similarity: measuring the relatedness of concepts", Demonstration Papers at HLT-NAACL 2004, pp. 38-41, Boston, Massachusetts, 1614037, 2004 of Conference.
﹝36﹞Knijff, Jeroen, Kevin Meijer, Flavius Frasincar and Frederik Hogenboom, Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora, 6997, Springer Berlin Heidelberg, 2011.
﹝37﹞JWNL, 2012 April 5, Retrieved from http://sourceforge.net/apps/mediawiki/jwordnet/.
﹝38﹞Medical Subject Headings: MeSH Tree Structures - 2011, 2012 April 5, Retrieved from http://www.nlm.nih.gov/mesh/2011/mesh_trees/trees.html.
﹝39﹞Henschel, A., Woon Wei Lee, T. Wachter and S. Madnick, "Comparison of generality based algorithm variants for automatic taxonomy generation", Innovations in Information Technology, 2009. IIT '09. International Conference on, pp. 160-164, 2009 15-17 Dec. 2009.
﹝40﹞Sneath, P.H.A. and Sokal, R.R., Unweighted Pair-Group Method Using Arithmetic averages, San Francisco, California, USA, 1973.
﹝41﹞Steinbach, M., G. Karypis and V. Kumar, "A comparison of document clustering techniques", 6th ACM SIGKDD, World Text Mining Conference, pp. Steinbach00, 2000 of Conference.