| 研究生: |
童俊宏 Jiun-Hung Tung |
|---|---|
| 論文名稱: |
無候選型樣產生之頻繁樹狀結構探勘 MINT: Mining Frequent Rooted Induced Unordered Tree without Candidate Generation |
| 指導教授: |
張嘉惠
Chia-Hui Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 36 |
| 中文關鍵詞: | 子樹 、標準型式 、支持度 、頻繁 、型樣 |
| 外文關鍵詞: | canonical form, subtree, pattern, frequent, support |
| 相關次數: | 點閱:9 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘(Data Mining)的領域中樹狀結構的探勘(Tree Mining)是一個重要的問題,它可以應用在網站記錄(Web Logs)的分析、生物資訊(Bioinformatics)和半結構式的文件(Semi-structured Documents)上。然而在此方面的先前研究都是先產生候選型樣,再測試其是否為頻繁出現的型樣,如果不是則會被刪除。以這樣的做法會用都掉很多的時間及空間在候選者的產生與測試上。所以,在此篇論文裡面,我們使用區域頻繁的這個概念設計了一個不會有候選者產生的演算法來做「有樹根的」、「誘導的」、「無序的」樹狀結構的探勘工作,而我們把這個演算法稱為MINT。我們利用資料產生器產生一些人工合成的資料集,以及實際的網站記錄資料,和HybridTreeMiner 來做比較。實驗結果顯示出即使在樹狀結構這種複雜的資料型態中,使用找尋區域頻繁的觀念是依然可以有不錯的效能。
Tree pattern mining is an important issue in data mining area and it has many emerging applications including web log analysis, bioinformatics, semi-structured documents, and so on. However, most of the previous works are candidate-generation-and-testing approach. They enumerate candidate patterns from shorter patterns based on the apriori frequent patterns. Because this approach costs a lot of time and space in candidate generation and testing, in this paper, we adopt the idea of pattern growth to mine frequent rooted induced unordered tree without candidate generation. In the performance study, we use synthetic datasets and real world application datasets to compare with HybridTreeMiner. The experiments show that our algorithm is an efficient algorithm and cost-effective.
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In proceedings of 1994 International Conference. Very Large Data Bases (VLDB’94), Setp.1994, 487-499.
[2] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H.Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data. In proceedings of the 2nd SIAM International Conference on Data Mining, April 2002.
[3] T. Asai, H. Arimura, T. Uno, and S. Nakano: Discovering Frequent Substructures in Large Unordered Trees. In proceedings of 6th International Conference on Discovery Science, October 2003.
[4] Y. Chi, Y. Yang, and R. R. Muntz, Indexing and Mining Free Trees. In proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03), November 2003.
[5] Y. Chi, Y. Yang, and R. R. Muntz, HybridTreeMiner: An Efficient Algorithm for Mining Frequent Rooted Trees and Free Trees Using Canonical Forms. In proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04), June 2004.
[6] Y. Chi, Y. Yang, and R. R. Muntz, Canonical Forms for Labeled Trees and Their Applications in Frequent Subtree Mining. Journal of Knowledge and Information Systems (KAIS), August 2005, 203-234.
[7] Y. Chi, Y. Yang, Y. Xia, and R. R. Muntz: CMTreeMiner, Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. IEEE Transactions on Knowledge and Data Engineering, 17(2), February, 2005.
[8] J. Han, J. Pei, Y. Yin, and R. Mao, Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Journal of Data Mining and Knowledge Discovery, 8(1), 53-87, 2004.
[9] K. Y. Huang, C. H. Chang and K. Z. Lin, PROWL: An efficient frequent continuity mining algorithm on event sequences. In proceedings of 6th International Conference on Data Warehousing and Knowledge Discovery (DaWak), 2004.
[10] S. Nijssen and J. N. Kok: Efficient Discovery of Frequent Unordered Trees. 1st international Workshop on Mining Graphs, Trees and Sequences, 2003.
[11] H. Tan, T. S. Dillon, F. Hadzic, E. Chang, and L. Feng, IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding. In proceeding of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2006), 450 - 461, April 9-12 2006.
[12] C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi, Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In proceedings of PAKDD, 2004.
[13] Y. Xiao, J. F. Yao, Z. Li, and M. H. Dunham, Efficient Data Mining for Maximal Frequent Subtrees. In proceedings of the 3rd IEEE international Conference on Data Mining, 2003.
[14] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, H-Mine: Hyper-Structure Mining of Frequent Pattern in Large Database. In proceedings of International Conference on Data Mining (ICDM), 2001.
[15] J. Pei, J. Han, B. M. Asl, and H. Pinto, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In proceedings of 17th International Conference on Data Engineering (ICDE), 2001.
[16] J. Punin, M. Krishnamoorthy, M. Zaki, LOGML: Log markup language for web usage mining. In WEBKDD Workshop (with SIGKDD), August 2001.
[17] Y. Xiao, J. F. Yao, and G. Yang, Discovering Frequent Embedded Subtree Patterns from Large Databases of Unordered Labeled Trees. International Journal of Data Warehousing and Mining (IJDWM), 1(2), 44-66, April-June 2005.
[18] M. J. Zaki, C. C. Aggarwal, XRules: An Effective Structural Classifier for XML Data. In proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003.
[19] M. J. Zaki, Efficiently Mining Frequent Trees in a Forest, Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1021-1035, August 2005.
[20] M. J. Zaki, Efficiently Mining Frequent Embedded Unordered Trees. In proceedings of the Fundamenta Informaticae, 2005.