平行運算架構下之巨量資料探勘：分散式與雲端方法之比較

簡易檢索 / 詳目顯示

回結果列表

研究生：	葉貞麟 Chen-lin Yeh
論文名稱：	平行運算架構下之巨量資料探勘：分散式與雲端方法之比較 Big Data Mining with Parallel Computing: A Comparison of Distributed and MapReduce Methodologies
指導教授：	蔡志豐 Chih-fong Tsai
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理學系 Department of Information Management
論文出版年：	2015
畢業學年度：	103
語文別：	中文
論文頁數：	109
中文關鍵詞：	巨量資料、資料探勘、分散式運算、雲端運算、樣本選取
外文關鍵詞：	Big Data, Data Mining, Distributed Computing, Cloud Computing, Instance Selection
相關次數：	點閱：16 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

如今巨量資料 (Big Data) 狂潮來襲，資訊唾手可得的情景已在我們日常生活中，資料爆炸的速度已經超越摩爾定律，對於巨量資料的駕馭能力出現一大挑戰，最大的挑戰在於哪些技術能更善用巨量資料。Big Data的本質就是從資料探勘 (Data Mining) 延伸出來的概念，即時設法從龐大的資料中探勘出資料的價值。隨著網路的普及和雲端運算的發展，打破傳統資料探勘的侷限，運算巨量資料探勘有效率地縮短運算時間。大數據科學家John Rauser定義巨量資料就是超過了一台電腦處理能力的龐大資料量，若以目前單機的硬體設備運算會有運算速度不符合需求、資料儲存容量過小等問題，所以本研究針對傳統資料探勘環境與流程作了改進。本研究的目的是分析兩種運算技術：分散式架構與雲端MapReduce架構，整合運算資源來針對大型資料集做資料分類，就能擴增儲存容量並搭配強大運算能力，加快探勘速度。另一方面，運用樣本選取 (Instance Selection) 過濾雜訊資料能達到資料減量的效果，探討利用資料前處理於巨量資料是否為必要之流程。最後得出何種架構和流程之下，在不犧牲正確率的情況下，獲得最快的執行時間。而實驗結果顯示採用單一台大型主機建置雲端架構，機器數為1~20台且未使用資料前處理配合SVM分類器直接進行分類更能有效率地處理大型資料集。使用四個多至五十萬筆從UCI資料庫和KDD cup提供的大型資料集，來顯示我們所提出的架構與流程的有效性。

The dataset size is growing faster than Moore's Law, and the big data frenzy is currently sweeping through our daily life. The challenges of managing massive amounts of big data involve the techniques of making the data accessible. The big data concept is general and encapsulates much of the essence of data mining techniques and they can discovery the most important and relevant knowledge to be valuable information. The advancement of the Internet technology and the popularity of cloud computing can break through the time efficiency limitation of traditional data mining methods over very large scale dataset problems. The technology of big data mining should create the conditions for the efficient mining of massive amounts of data with the aim of creating real-time useful information. The data scientist, John Rauser, defines big data as “any amount of data that’s too big to be handled by one computer.” A standalone device does not have enough memory to efficiently handle big data, and the storage capacity as well. Therefore, big data mining can be efficiently performed via the conventional distributed and MapReduce methodologies. This raises an important research question: Do the distributed and MapReduce methodologies over large scale datasets perform differently in mining accuracy and efficiency? And one more question: Does Big data mining need data preprocessing? The experimental results based on four large scale datasets show that the using MapReduce without data preprocessing requires the lest processing time and it allows the classifier to provide the highest rate of classification no matter how many computer nodes are used except for a class imbalance dataset.

摘要    i
Abstract    ii
誌謝    iii
目錄    iv
圖目錄    vi
表目錄    viii
第一章 緒論    1
1    研究背景    1
2    研究動機    2
3    研究目的    4
4    研究架構    6
第二章 文獻探討    8
1    巨量資料    8
2    分散式運算    10
2.1    分散式運算簡介    10
2.2    分散式資料探勘    11
3    雲端運算    13
3.1    雲端運算簡介    13
3.2    系統架構    15
3.3    雲端資料探勘    20
4    資料分類    25
5    資料前處理    28
5.1    IB3    30
5.2    DROP3    31
5.3    GA    33
第三章 實驗方法    35
1    實驗一    36
1.1    Baseline    36
1.2    分散式架構    37
1.3    雲端架構    38
1.3.1    單機雲端    38
1.3.2    叢集雲端    39
2    實驗二    41
2.1    單機樣本選取    41
2.2    分散式架構    42
2.3    雲端架構    43
2.3.1    單機雲端    43
2.3.2    叢集雲端    44
第四章 實驗結果    45
1    實驗設定    46
1.1    資料集    46
1.2    實驗電腦環境    47
1.3    模型驗證準則    48
2    實驗結果    49
2.1    實驗一結果    49
2.2    實驗二結果    59
3    討論與建議    71
第五章 結論    77
1    結果與貢獻    77
2    研究限制與後續研究建議與方向    79
參考文獻    82
附錄一    87
附錄二    91
附錄三    95
附錄四    96
                                

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37-66.
Back, T. (1996). Evolutionary algorithms in theory and practice: Oxford Univ. Press.
Beyer, M. A., & Laney, D. (2012). The importance of'big data': a definition. Stamford, CT: Gartner.
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quarterly, 56(1), 75-86.
Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. Evolutionary Computation, IEEE Transactions on, 7(6), 561-575.
Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323-332.
Cervantes, J., Li, X., & Yu, W. (2008). Support vector classification for large data sets by reducing training data with change of classes. Paper presented at the Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on.
Cheung, D. W., Ng, V. T., Fu, A. W., & Fu, Y. (1996). Efficient mining of association rules in distributed databases. Knowledge and Data Engineering, IEEE Transactions on, 8(6), 911-922.
Collins, D. (2006). Using VMWare and live CD's to configure a secure, flexible, easy to manage computer lab environment. Journal of Computing Sciences in Colleges, 21(4), 273-277.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/BF00994018
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods: Cambridge university press.
Da Silva, J. C., Giannella, C., Bhargava, R., Kargupta, H., & Klusch, M. (2005). Distributed data mining and agents. Engineering Applications of Artificial Intelligence, 18(7), 791-807.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Derrac, J., García, S., & Herrera, F. (2010). A survey on evolutionary instance selection and generation.
Diebold, F. X., Cheng, X., Diebold, S., Foster, D., Halperin, M., Lohr, S., . . . Pospiech, M. (2012). A Personal Perspective on the Origin (s) and Development of “Big Data”: The Phenomenon, the Term, and the Discipline∗.
Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168.
Dong, J.-x., Devroye, L., & Suen, C. Y. (2005). Fast SVM training algorithm with decomposition on very large data sets. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(4), 603-618.
Fan, W., & Bifet, A. (2013). Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 14(2), 1-5.
Foster, I., Yong, Z., Raicu, I., & Shiyong, L. (2008, 12-16 Nov. 2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper presented at the Grid Computing Environments Workshop, 2008. GCE '08.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.
Godfrey, B. (2006). A primer on distributed computing. DOI= http://www.bacchae.co. uk/docs/dist. html. Accessed March, 8, 2010.
Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14.
Guralnik, V., & Karypis, G. (2004). Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4), 443-472.
Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence: U Michigan Press.
Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. Paper presented at the ACM SIGOPS Operating Systems Review.
Jansen, E. (2003). Netlingo: The Internet Dictionary: Golden Books Centre.
Januzaj, E., Kriegel, H.-P., & Pfeifle, M. (2004). Scalable density-based distributed clustering Knowledge Discovery in Databases: PKDD 2004 (pp. 231-244): Springer.
Jeffrey, C., Brian, D., Mark, D., Joseph, M. H., & Caleb, W. (2009). MAD skills: new analysis practices for big data. Proc. VLDB Endow., 2(2), 1481-1492. doi: 10.14778/1687553.1687576
Jie, L., Zheng, X., Yayun, J., & Rui, Z. (2014, 18-20 Aug. 2014). The overview of big data storage and management. Paper presented at the Cognitive Informatics & Cognitive Computing (ICCI*CC), 2014 IEEE 13th International Conference on.
Karau, H. (2013). Fast Data Processing With Spark: Packt Publishing Ltd.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Paper presented at the Ijcai.
Kovári, A., & Dukan, P. (2012). KVM & OpenVZ virtualization based IaaS open source cloud virtualization platforms: OpenNode, Proxmox VE. Paper presented at the Intelligent Systems and Informatics (SISY), 2012 IEEE 10th Jubilee International Symposium on.
Kuhn, H. W. (2014). Nonlinear programming: a historical view Traces and Emergence of Nonlinear Programming (pp. 393-414): Springer.
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group Research Note, 6.
Mashey, J. R. (1997). Big Data and the Next Wave of InfraS-tress. Paper presented at the Computer Science Division Seminar, University of California, Berkeley.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think: Houghton Mifflin Harcourt.
Mell, P., & Grance, T. (2011). The NIST definition of cloud computing.
Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
Noll, M. G. (2007). Running hadoop on ubuntu linux (single-node cluster). Mar-2013.[Online]. Available: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-nodecluster/ [Accessed:12-Jun-2013].
Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.
Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., & Epema, D. (2010). A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing. In D. Avresky, M. Diaz, A. Bode, B. Ciciani & E. Dekel (Eds.), Cloud Computing (Vol. 34, pp. 115-131): Springer Berlin Heidelberg.
Pallis, G. (2010). Cloud Computing: The New Frontier of Internet Computing. Internet Computing, IEEE, 14(5), 70-73. doi: 10.1109/MIC.2010.113
Panjwani, M. L., & Makhijani, R. K. (2013). Distributed Data Mining and Approaches.
Petre, R. S. (2012). Data mining in cloud computing. Database Systems Journal, 3(3), 67-71.
Rajaraman, A. (2008). More data usually beats better algorithms. Datawocky Blog.
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Paper presented at the Collaboration Technologies and Systems (CTS), 2013 International Conference on.
Shackelford, R., McGettrick, A., Sloan, R., Topi, H., Davies, G., Kamali, R., . . . Lunt, B. (2006). Computing curricula 2005: The overview report. ACM SIGCSE Bulletin, 38(1), 456-457.
Spath, D., Ganschar, O., Gerlach, S., Hämmerle, M., Krause, T., & Schlund, S. (2013). Produktionsarbeit der Zukunft-Industrie 4.0: Fraunhofer Verlag.
Sugerman, J., Venkitachalam, G., & Lim, B.-H. (2001). Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. Paper presented at the USENIX Annual Technical Conference, General Track.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining (Vol. 1): Pearson Addison Wesley Boston.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5), 988-999.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.
Wang, L., Von Laszewski, G., Younge, A., He, X., Kunze, M., Tao, J., & Fu, C. (2010). Cloud computing: a perspective study. New Generation Computing, 28(2), 137-146.
White, T. (2009). Hadoop: the definitive guide: the definitive guide: " O'Reilly Media, Inc.".
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on(3), 408-421.
Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3), 257-286.
Xindong, W., Xingquan, Z., Gong-Qing, W., & Wei, D. (2014). Data mining with big data. Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107. doi: 10.1109/TKDE.2013.109
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., . . . Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Paper presented at the Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. Paper presented at the Proceedings of the 2nd USENIX conference on Hot topics in cloud computing.
Zaki, M. J. (2000). Parallel and distributed data mining: An introduction Large-Scale Parallel Data Mining (pp. 1-23): Springer.
城田真琴. (2013). Big Data大數據的獲利模式: 圖解.案例.策略.實戰: 經濟新潮社出版.

簡易檢索 / 詳目顯示

相關論文