基於圖形的平行化最小生成樹分群演算法｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳詩姍 Shi-Shan Chen
論文名稱：	基於圖形的平行化最小生成樹分群演算法 A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
指導教授：	王尉任 Wei-Jen Wang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
畢業學年度：	100
語文別：	中文
論文頁數：	73
中文關鍵詞：	MPI 、平行計算、分群演算法、最小生成樹
外文關鍵詞：	MPI, Minimum Spanning Tree, Clustering Algorithm, Parallel Computing
相關次數：	點閱：15 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著科技的快速發展，人們的生活中無時無刻皆在產生大筆資料，並等待資料處理及統整完成。而在面臨大筆資料量時，所需的計算時間及儲存空間也大幅提升。為了能夠在合理的時間與空間中有效的得到真正所需要的資訊，本研究提出了基於圖形的平行化最小生成樹分群演算法，可自動偵測出資料的最佳群數並透過平行分配資料量有效的改善在大筆資料時的執行時間及儲存空間。本研究以將資料平均分配計算為前提，提出以Message Passing Interface(MPI)函式庫實作的兩種計算架構，能夠讓最小生成樹平行分散計算並套用基於圖形的分群方式演算法(graph-based clustering method, GBC method)將資料進行分群。這兩種方法分別為基於區域最小生成樹群的平行化自動分群演算法(Parallel Clustering based on Partitions of Local Minimal-spanning-trees, Para-CPLM)以及基於全域最小生成樹的平行化自動分群演算法(Parallel Clustering based on a Global Minimal-spanning-tree, Para-CGM)。
Para-CPLM利用平行化的方式，將圖形上的所有點數平均分給多個process，分配方法為依據圖形的各維度作切割成多個區塊後再於每個區塊建立區域最小生成樹，然後將各區域的最小生成樹以最短距離連接起來並套用GBC method進行第一次分群，之後再檢查所有群數之間的距離，重新找出各群之間最短的距離且符合MST的條件。Para-CGM則是藉由一個預先設定的臨界值，過濾不必要儲存的資料，並使用平行的方式分配過濾後的資料給各processes計算，達到減少計算時間以及儲存空間的目的。
實驗結果顯示出，若拿Para-CPLM與GBC method相比，則Para-CPLM在執行時間上有相當顯著的改善，且分群結果和GBC method大多相同。而在足夠的計算資源下，Para-CGM在時間的改善上，則不像Para-CPLM能夠隨著計算資源數目 (如: CPU數目) 的增加而有效的加速，但仍比GBC method的執行時間來得快，且分群結果和GBC method完全一樣。

Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree).
The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently.
From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results.

摘要	I
Abstract	III
誌謝	V
目錄	VI
圖目錄	VIII
表目錄	IX
第一章　緒論	1
1-1　分群演算法(Clustering Algorithm)	2
1-2	　平行計算(Parallel Computing)	3
1-3	　基於圖形的分群方式演算法	5
(Graph-Based Clustering method, GBC method)	5
1-4	　研究目標	5
1-5	　研究貢獻	7
1-6	　論文文章架構	7
第二章　研究背景	8
2-1　MST(Minimum Spanning Tree)	8
2-2　MPI(Message Passing Interface)	12
2-3　基於圖形的分群方式演算法	14
(Graph-Based Clustering method, GBC method)	14
第三章　Parallel Clustering based on Partitions of Local Minimal-spanning-trees (Para-CPLM)	18
3-1　區域最小生成樹群演算法	20
3-1-1　Dividing	20
3-1-2　Merging	21
3-2　Clustering	23
3-3　Modifying and Clustering	24
第四章　Parallel Clustering based on a Global Minimal-spanning-tree (Para-CGM)	28
4-1　平行化Similarity Matrix計算	30
4-1-1　平行分散Similarity Matrix計算策略	30
4-1-2　Para-CGM架構下的Threshold決定策略	32
4-1-3　減少Similarity Matrix的儲存空間策略	33
第五章　實驗結果	38
5-1　實驗環境與實驗資料集	38
5-2　實驗結果與分析	41
5-2-1　Para-CPLM與Para-CGM實驗結果比較	42
5-2-2　Para-CPLM在一些案例上的特殊結果	47
5-3　Para-CPLM使用多維切割的方法及表現	55
5-4　GBC method、Para-CPLM、Para-CGM時間分析	57
第六章　結論與未來展望	60
參考文獻	61

                                

[1] J. Shao, X. He, C. Bohm, Q. Yang, and C. Plant, “Synchronization-Inspired Partitioning and Hierarchical Clustering,” IEEE Transactions on Knowledge and Data Engineering, 2012.
[2] M. Potter and C. Couldrey, “A Cooperative Coevolutionary Approach to Partitional Clustering,” Parallel Problem Solving from Nature, vol. 6238, no. 1, pp. 374–383, 2010.
[3] M. Dash, S. Petrutiu, and P. Scheuermann, “pPOP: Fast yet accurate parallel hierarchical clustering using partitioning.,” Data & Knowledge Engineering, vol. 61, no. 3, pp. 563–578, 2007.
[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” International Conference on Management of Data (SIGMOD), pp. 103–114, 1996.
[5] K.-M. Yu, C.-Y. Lin, H.-Y. Wang, C. Y. Tang, and J. Zhou, “Parallel branch-and-bound approach with MPI technology in inferring chemical compounds with path frequency,” IEEE International Conference on Granular Computing, pp. 733 –738, 2009.
[6] F. Muhlenbach and S. Lallich, “A New Clustering Algorithm Based on Regions of Influence with Self-Detection of the Best Number of Clusters,” IEEE International Conference on Data Mining, pp. 884–889, 2009.
[7] Scott Vetter, Yukiya Aoyama, and Jun Nakano, “RS/6000 SP: Practical MPI Programming,” 1999.
[8] S. Bandyopadhyay and U. Maulik, “Genetic clustering for automatic evolution of clusters and application to image classification,” Pattern Recognition, vol. 35, no. 6, pp. 1197–1208, 2002.
[9] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “Dynamic clustering using Particle Swarm Optimization with application in image segmentation,” World Enformatika Conference, pp. 332–344, 2005.
[10] Y. Chen, K. D. Reilly, A. P. Sprague, and Z. Guan, “SEQOPTICS: A Protein Sequence Clustering Method,” International Multi-Symposiums on Computer and Computational Sciences, vol. 1, no. 1, pp. 69 –75, 2006.
[11] R. Krishnapuram, A. Joshi, and L. Yi, “A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering,” Fuzzy Systems Conference Proceedings, vol. 3, no. 1, pp. 1281 –1286, 1999.
[12] B. Mobasher, R. Cooley, and J. Srivastava, “Creating Adaptive Web Sites Through Usage-Based Clustering of URLs,” Knowledge and Data Engineering Exchange, pp. 19–25, 1999.
[13] “Welcome to Apache Hadoop!,” http://hadoop.apache.org/.
[14] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, and R. M. and V. S. Sunderam, “PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Network Parallel Computing,” 1994.
[15] “OpenMP.org,” http://openmp.org/wp/.
[16] “Open MPI: Open Source High Performance Computing,” http://www.open-mpi.org/.
[17] M. E. Zosel, “High performance Fortran: an overview,” Compcon Spring , Digest of Papers., pp. 132 –136, 1993.
[18] “CUDA Zone,” http://www.nvidia.com/object/cuda_home_new.html.
[19] B. C. Prim, “Shortest connecting networks and some generalizations,” Bell System Technical Journal, vol. 36, no. 1, pp. 1389–1401, 1959.
[20] J. B. Kruskal, “On the shortest spanning subtree of a graph and the traveling salesman problem,” American Mathematical Society, vol. 7, no. 1, pp. 48–50, 1956.
[21] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.
[22] S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern Recognition, vol. 43, no. 3, pp. 738–751, 2010.
[23] L. Wang, C. Leckie, R. Kotagiri, and J. Bezdek, “Approximate pairwise clustering for large data sets via sampling plus extension,” Pattern Recognition, vol. 44, no. 2, pp. 222–235, 2011.
[24] E. H. Ruspini, “Numerical methods for fuzzy clustering,” Information Sciences, vol. 2, no. 3, pp. 319–350, 1970.

簡易檢索 / 詳目顯示

相關論文