| 研究生: |
陳詩姍 Shi-Shan Chen |
|---|---|
| 論文名稱: |
基於圖形的平行化最小生成樹分群演算法 A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters |
| 指導教授: |
王尉任
Wei-Jen Wang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 畢業學年度: | 100 |
| 語文別: | 中文 |
| 論文頁數: | 73 |
| 中文關鍵詞: | MPI 、平行計算 、分群演算法 、最小生成樹 |
| 外文關鍵詞: | MPI, Minimum Spanning Tree, Clustering Algorithm, Parallel Computing |
| 相關次數: | 點閱:15 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技的快速發展,人們的生活中無時無刻皆在產生大筆資料,並等待資料處理及統整完成。而在面臨大筆資料量時,所需的計算時間及儲存空間也大幅提升。為了能夠在合理的時間與空間中有效的得到真正所需要的資訊,本研究提出了基於圖形的平行化最小生成樹分群演算法,可自動偵測出資料的最佳群數並透過平行分配資料量有效的改善在大筆資料時的執行時間及儲存空間。本研究以將資料平均分配計算為前提,提出以Message Passing Interface(MPI)函式庫實作的兩種計算架構,能夠讓最小生成樹平行分散計算並套用基於圖形的分群方式演算法(graph-based clustering method, GBC method)將資料進行分群。這兩種方法分別為基於區域最小生成樹群的平行化自動分群演算法(Parallel Clustering based on Partitions of Local Minimal-spanning-trees, Para-CPLM)以及基於全域最小生成樹的平行化自動分群演算法(Parallel Clustering based on a Global Minimal-spanning-tree, Para-CGM)。
Para-CPLM利用平行化的方式,將圖形上的所有點數平均分給多個process,分配方法為依據圖形的各維度作切割成多個區塊後再於每個區塊建立區域最小生成樹,然後將各區域的最小生成樹以最短距離連接起來並套用GBC method進行第一次分群,之後再檢查所有群數之間的距離,重新找出各群之間最短的距離且符合MST的條件。Para-CGM則是藉由一個預先設定的臨界值,過濾不必要儲存的資料,並使用平行的方式分配過濾後的資料給各processes計算,達到減少計算時間以及儲存空間的目的。
實驗結果顯示出,若拿Para-CPLM與GBC method相比,則Para-CPLM在執行時間上有相當顯著的改善,且分群結果和GBC method大多相同。而在足夠的計算資源下,Para-CGM在時間的改善上,則不像Para-CPLM能夠隨著計算資源數目 (如: CPU數目) 的增加而有效的加速,但仍比GBC method的執行時間來得快,且分群結果和GBC method完全一樣。
Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree).
The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently.
From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results.
[1] J. Shao, X. He, C. Bohm, Q. Yang, and C. Plant, “Synchronization-Inspired Partitioning and Hierarchical Clustering,” IEEE Transactions on Knowledge and Data Engineering, 2012.
[2] M. Potter and C. Couldrey, “A Cooperative Coevolutionary Approach to Partitional Clustering,” Parallel Problem Solving from Nature, vol. 6238, no. 1, pp. 374–383, 2010.
[3] M. Dash, S. Petrutiu, and P. Scheuermann, “pPOP: Fast yet accurate parallel hierarchical clustering using partitioning.,” Data & Knowledge Engineering, vol. 61, no. 3, pp. 563–578, 2007.
[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” International Conference on Management of Data (SIGMOD), pp. 103–114, 1996.
[5] K.-M. Yu, C.-Y. Lin, H.-Y. Wang, C. Y. Tang, and J. Zhou, “Parallel branch-and-bound approach with MPI technology in inferring chemical compounds with path frequency,” IEEE International Conference on Granular Computing, pp. 733 –738, 2009.
[6] F. Muhlenbach and S. Lallich, “A New Clustering Algorithm Based on Regions of Influence with Self-Detection of the Best Number of Clusters,” IEEE International Conference on Data Mining, pp. 884–889, 2009.
[7] Scott Vetter, Yukiya Aoyama, and Jun Nakano, “RS/6000 SP: Practical MPI Programming,” 1999.
[8] S. Bandyopadhyay and U. Maulik, “Genetic clustering for automatic evolution of clusters and application to image classification,” Pattern Recognition, vol. 35, no. 6, pp. 1197–1208, 2002.
[9] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “Dynamic clustering using Particle Swarm Optimization with application in image segmentation,” World Enformatika Conference, pp. 332–344, 2005.
[10] Y. Chen, K. D. Reilly, A. P. Sprague, and Z. Guan, “SEQOPTICS: A Protein Sequence Clustering Method,” International Multi-Symposiums on Computer and Computational Sciences, vol. 1, no. 1, pp. 69 –75, 2006.
[11] R. Krishnapuram, A. Joshi, and L. Yi, “A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering,” Fuzzy Systems Conference Proceedings, vol. 3, no. 1, pp. 1281 –1286, 1999.
[12] B. Mobasher, R. Cooley, and J. Srivastava, “Creating Adaptive Web Sites Through Usage-Based Clustering of URLs,” Knowledge and Data Engineering Exchange, pp. 19–25, 1999.
[13] “Welcome to Apache Hadoop!,” http://hadoop.apache.org/.
[14] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, and R. M. and V. S. Sunderam, “PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Network Parallel Computing,” 1994.
[15] “OpenMP.org,” http://openmp.org/wp/.
[16] “Open MPI: Open Source High Performance Computing,” http://www.open-mpi.org/.
[17] M. E. Zosel, “High performance Fortran: an overview,” Compcon Spring , Digest of Papers., pp. 132 –136, 1993.
[18] “CUDA Zone,” http://www.nvidia.com/object/cuda_home_new.html.
[19] B. C. Prim, “Shortest connecting networks and some generalizations,” Bell System Technical Journal, vol. 36, no. 1, pp. 1389–1401, 1959.
[20] J. B. Kruskal, “On the shortest spanning subtree of a graph and the traveling salesman problem,” American Mathematical Society, vol. 7, no. 1, pp. 48–50, 1956.
[21] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.
[22] S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern Recognition, vol. 43, no. 3, pp. 738–751, 2010.
[23] L. Wang, C. Leckie, R. Kotagiri, and J. Bezdek, “Approximate pairwise clustering for large data sets via sampling plus extension,” Pattern Recognition, vol. 44, no. 2, pp. 222–235, 2011.
[24] E. H. Ruspini, “Numerical methods for fuzzy clustering,” Information Sciences, vol. 2, no. 3, pp. 319–350, 1970.