分徑指標在建立決策樹的比較｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	謝孟錫 Meng-Sei Hsieh
論文名稱：	分徑指標在建立決策樹的比較
指導教授：	王丕承 Pe-Cheng Wang
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 工業管理研究所 Graduate Institute of Industrial Management
畢業學年度：	90
語文別：	中文
論文頁數：	67
中文關鍵詞：	資料探勘、分類、決策樹、分徑指標
外文關鍵詞：	splitting index, data mining, classification, decision tree
相關次數：	點閱：18 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

資料探勘在近年來是非常受矚目的一個名詞，簡單來說，能從極龐大的資料中擷取出重要或有趣訊息的方法即為資料探勘，而它的目的主要概分為分類與預測(classification and prediction)及群集(clustering)兩種，其中決策樹是分類目的中接受度相當高的一種方法，它主要是利用演算法與電腦的便利性將資料受各變數影響的情形以樹狀展現出以達到分類的目的，並解決統計領域在資料量太大時無法全盤分析的問題，然而目前的文獻中幾乎都只對建立決策樹的演算法作探討，事實上當決定演算法後，如何選取分徑屬性與分徑點是一個關鍵，有鑑於此，本研究決定用建樹效率佳的BOAT演算法，再利用挖掘重要關聯規則的方法，將其應用到建立決策樹的分徑指標中，比較各指標包括Gini index、Entropy、λ、Rule interest與Laplace的預測率高低，並且利用它們節點數、最大階層數及平均階層數的不同透視樹的呈現有何異同。

目錄
提要	Ⅰ
目錄	Ⅱ
圖目錄	Ⅳ
表目錄	Ⅴ
第一章  緒論	1
1.1研究動機與重要性	1
1.2決策樹	3
1.3研究步驟	6
第二章 文獻探討	9
2.1分類演算法之探討	9
2.1.1早期分類演算法	9
2.1.2近期改進演算法	11
2.2相依性的統計量測方法：分徑指標	24
第三章分析模式與方法	29
3.1問題描述	29
3.2 BOAT演算法與屬性選擇	29
3.1.1 BOAT演算法	29
3.1.2多種分徑指標之介紹	36
第四章數據分析與績效表現	41
4.1資料介紹	41
4.2第一組資料之決策樹建立與績效表現	43
4.2.1資料的前置處理	43
4.2.2決策樹之建立	44
4.3第二組資料之決策樹建立與績效表現	49
4.3.1資料的前置處理	49
4.3.2決策樹之數據分析	49
第五章結論與建議	60
5.1研究結論	61
5.2未來研究方向	61
參考文獻	62
圖目錄
圖 1-1   知識自龐大資料庫發掘的流程圖	2
圖 1-2   決策樹簡圖	5
圖 1-3   數值屬性之分類圖表	6
圖 1-4   名目屬性之計數矩陣	6
圖 1-5   研究步驟流程圖	8
圖 2-1   事前排序例圖	13
圖 2-2   決策樹與類別列表	13
圖 2-3   類別列表更新處理	14
圖 2-4    透過通訊網路交換類別分配圖表之假想圖	16
圖 2-5   名目屬性之計數矩陣(Ⅰ)	16
圖 2-6    名目屬性之計數矩陣(Ⅱ)	17
圖 2-7    SLIQ/R資料結構圖	18
圖 2-8    SLIQ/D資料結構圖	19
圖 2-9    RF-READ演算法之樹狀假想圖.	21
圖 3-1   抽樣階段之雛型樹	33
圖 3-2    假設抽樣階段成立下之最終樹示意圖	35
圖 4-1    以GINI分徑方式建出之決策樹	45
圖 4-2    以ENTROPY分徑方式建出之決策樹.	46
圖 4-3   以LAMDA分徑方式建出之決策樹	47
圖 4-4   資料的增加使得預測率提高	47
表目錄
表2-1   近代演算法的比較表(Ⅰ)	23
表2-2   近代演算法的比較表(Ⅱ)	24
表 2-3   2×2變數列聯表	25
表3-1   屬性與二元類別的關係表	37
表3-2   屬性與四元類別的關係表	39
表4-1   第一組資料屬性與類別說明	42
表4-2   第二組資料屬性與類別說明	43
表4-3   不同分徑方式與決策樹特性值對照表	48
表4-4   一萬筆資料的決策樹分析	50
表4-5   十萬筆資料的決策樹分析	51
表4-6   平均預測率在不同分徑指標與資料數的對照表	52
表4-7   各分徑指標優缺點對照表	53
表4-8   資料加入一新屬性的決策樹分析	54
表4-9   一萬筆資料減少一屬性的決策樹分析	55
表4-10  十萬筆資料減少一屬性的決策樹分析	56
表4-11  用訓練資料測試決策樹的分析結果	58

                                

參考文獻
BICKEL, P., RITOV, Y. and STOKER, T. (2001). Tailor-made tests for goodness of fit for semiparametric hypotheses. Unpublisher manuscript.
BREIMAN, L. (2001). Statistical Modeling: The Two Cultures. Statist. Sci. 16 3 199-231.
BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
BRIN, S., MOTWANI, R. and SILVERSTEIN, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In Proceedings of ACM SIGMOD Conference on Management of Data 265-276. Tucson, Arizona.
BRIN, S., MOTWANI, R., ULLMAN, J. and Tsur, S. (1997). Dymanic itemset counting and implication rules for market basket data. In Proceedings of ACM SIGMOD Conference on Management of Data 255-264. Tucson, Arizona.
CATLETT, J. (1991). Megainduction: Machine Learning on Very Large Databases. PhD thesis, Sydney Univ.
CHAN, P. and STOLFO, S. (1993). Experiment on multistrategy learning by meta-learning. In Proceedings of the Second International Conference on Information and Knowledge Management 314-323. Washington, DC.
CLARK, P. and BOSERLL, R. (1991). Rule induction with cn2: Some recent improvement. In Proceedings of the Fifth European Working Session on Learning 151-163. Springer, Berlin.
GREHRKE, J., GANTI, V., RAMAKRISHNAN. and LOH, W. Y. (1998). Rainforest-A framework for fast decision tree construction of large datasets. In Proceedings of the 1998 VLDB Conference 416-427. New York.
GREHRKE, J., GANTI, V., RAMAKRISHNAN. and LOH, W. Y. (1999). BOAT-optimistic decision tree construction. In Proceedings of the 1999 SIGMOD Conference 169-180. Philadelphia, Pennsylvania.
GOODMAN, L. and KRUSKAL, W. (1959). Measures of association for cross classifications, ii: Further discussion and References. J. Amer. Statist. Assoc. 54 123-163.
MEHTA, M., AGRAWAL, R. and RISSANEN, J. (1996). SLIQ: A fast scalable classifier for data mining. In Proceedings of the Fifth EDBT Conference 18-32. Avignon, France.
MEHTA, M., AGRAWAL, R. and SHAFER, J. (1996). SPRINT: A scalable parallel classifier for data mining. In Proceedings of the 1996 VLDB Conference 544-555. Mumbai (Bombay), India.
PIATETSKY-SHAPIRO, G. (1991). Discovery, analysis and presentation of strong rule. Know. Disc. in Databases 229-248. AAAI/MIT Press, Menlo Park, California.
QUINLAN, J. (1979). Induction over large database. Technical Report 79-14, Dept. Computer Science Stanford Uni.
QUINLAN, J. (1986). Induction of decision tree. Machine Learning 1 81-106.
STONE, M. (1974). Cross-validatory choice and assessment of statistical prediction. J. Roy. Statist. Soc. 36 111-147.
TAN, P., KUMAR and KUMAR, V. (2000). Interestingness measures for association patterns: A perspective. Technical Report 00-36, Dept. Computer Science Minnesota Uni.
WEISS, S. and KULIKOWSKI, C. (1991). Computer Systems that learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann.

簡易檢索 / 詳目顯示

相關論文