| 研究生: |
葉貞麟 Chen-lin Yeh |
|---|---|
| 論文名稱: |
平行運算架構下之巨量資料探勘:分散式與雲端方法之比較 Big Data Mining with Parallel Computing: A Comparison of Distributed and MapReduce Methodologies |
| 指導教授: |
蔡志豐
Chih-fong Tsai |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理學系 Department of Information Management |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 中文 |
| 論文頁數: | 109 |
| 中文關鍵詞: | 巨量資料 、資料探勘 、分散式運算 、雲端運算 、樣本選取 |
| 外文關鍵詞: | Big Data, Data Mining, Distributed Computing, Cloud Computing, Instance Selection |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
如今巨量資料 (Big Data) 狂潮來襲,資訊唾手可得的情景已在我們日常生活中,資料爆炸的速度已經超越摩爾定律,對於巨量資料的駕馭能力出現一大挑戰,最大的挑戰在於哪些技術能更善用巨量資料。Big Data的本質就是從資料探勘 (Data Mining) 延伸出來的概念,即時設法從龐大的資料中探勘出資料的價值。隨著網路的普及和雲端運算的發展,打破傳統資料探勘的侷限,運算巨量資料探勘有效率地縮短運算時間。大數據科學家John Rauser定義巨量資料就是超過了一台電腦處理能力的龐大資料量,若以目前單機的硬體設備運算會有運算速度不符合需求、資料儲存容量過小等問題,所以本研究針對傳統資料探勘環境與流程作了改進。本研究的目的是分析兩種運算技術:分散式架構與雲端MapReduce架構,整合運算資源來針對大型資料集做資料分類,就能擴增儲存容量並搭配強大運算能力,加快探勘速度。另一方面,運用樣本選取 (Instance Selection) 過濾雜訊資料能達到資料減量的效果,探討利用資料前處理於巨量資料是否為必要之流程。最後得出何種架構和流程之下,在不犧牲正確率的情況下,獲得最快的執行時間。而實驗結果顯示採用單一台大型主機建置雲端架構,機器數為1~20台且未使用資料前處理配合SVM分類器直接進行分類更能有效率地處理大型資料集。使用四個多至五十萬筆從UCI資料庫和KDD cup提供的大型資料集,來顯示我們所提出的架構與流程的有效性。
The dataset size is growing faster than Moore's Law, and the big data frenzy is currently sweeping through our daily life. The challenges of managing massive amounts of big data involve the techniques of making the data accessible. The big data concept is general and encapsulates much of the essence of data mining techniques and they can discovery the most important and relevant knowledge to be valuable information. The advancement of the Internet technology and the popularity of cloud computing can break through the time efficiency limitation of traditional data mining methods over very large scale dataset problems. The technology of big data mining should create the conditions for the efficient mining of massive amounts of data with the aim of creating real-time useful information. The data scientist, John Rauser, defines big data as “any amount of data that’s too big to be handled by one computer.” A standalone device does not have enough memory to efficiently handle big data, and the storage capacity as well. Therefore, big data mining can be efficiently performed via the conventional distributed and MapReduce methodologies. This raises an important research question: Do the distributed and MapReduce methodologies over large scale datasets perform differently in mining accuracy and efficiency? And one more question: Does Big data mining need data preprocessing? The experimental results based on four large scale datasets show that the using MapReduce without data preprocessing requires the lest processing time and it allows the classifier to provide the highest rate of classification no matter how many computer nodes are used except for a class imbalance dataset.
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37-66.
Back, T. (1996). Evolutionary algorithms in theory and practice: Oxford Univ. Press.
Beyer, M. A., & Laney, D. (2012). The importance of'big data': a definition. Stamford, CT: Gartner.
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quarterly, 56(1), 75-86.
Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. Evolutionary Computation, IEEE Transactions on, 7(6), 561-575.
Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323-332.
Cervantes, J., Li, X., & Yu, W. (2008). Support vector classification for large data sets by reducing training data with change of classes. Paper presented at the Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on.
Cheung, D. W., Ng, V. T., Fu, A. W., & Fu, Y. (1996). Efficient mining of association rules in distributed databases. Knowledge and Data Engineering, IEEE Transactions on, 8(6), 911-922.
Collins, D. (2006). Using VMWare and live CD's to configure a secure, flexible, easy to manage computer lab environment. Journal of Computing Sciences in Colleges, 21(4), 273-277.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/BF00994018
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods: Cambridge university press.
Da Silva, J. C., Giannella, C., Bhargava, R., Kargupta, H., & Klusch, M. (2005). Distributed data mining and agents. Engineering Applications of Artificial Intelligence, 18(7), 791-807.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Derrac, J., García, S., & Herrera, F. (2010). A survey on evolutionary instance selection and generation.
Diebold, F. X., Cheng, X., Diebold, S., Foster, D., Halperin, M., Lohr, S., . . . Pospiech, M. (2012). A Personal Perspective on the Origin (s) and Development of “Big Data”: The Phenomenon, the Term, and the Discipline∗.
Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168.
Dong, J.-x., Devroye, L., & Suen, C. Y. (2005). Fast SVM training algorithm with decomposition on very large data sets. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(4), 603-618.
Fan, W., & Bifet, A. (2013). Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 14(2), 1-5.
Foster, I., Yong, Z., Raicu, I., & Shiyong, L. (2008, 12-16 Nov. 2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper presented at the Grid Computing Environments Workshop, 2008. GCE '08.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.
Godfrey, B. (2006). A primer on distributed computing. DOI= http://www.bacchae.co. uk/docs/dist. html. Accessed March, 8, 2010.
Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14.
Guralnik, V., & Karypis, G. (2004). Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4), 443-472.
Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence: U Michigan Press.
Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. Paper presented at the ACM SIGOPS Operating Systems Review.
Jansen, E. (2003). Netlingo: The Internet Dictionary: Golden Books Centre.
Januzaj, E., Kriegel, H.-P., & Pfeifle, M. (2004). Scalable density-based distributed clustering Knowledge Discovery in Databases: PKDD 2004 (pp. 231-244): Springer.
Jeffrey, C., Brian, D., Mark, D., Joseph, M. H., & Caleb, W. (2009). MAD skills: new analysis practices for big data. Proc. VLDB Endow., 2(2), 1481-1492. doi: 10.14778/1687553.1687576
Jie, L., Zheng, X., Yayun, J., & Rui, Z. (2014, 18-20 Aug. 2014). The overview of big data storage and management. Paper presented at the Cognitive Informatics & Cognitive Computing (ICCI*CC), 2014 IEEE 13th International Conference on.
Karau, H. (2013). Fast Data Processing With Spark: Packt Publishing Ltd.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Paper presented at the Ijcai.
Kovári, A., & Dukan, P. (2012). KVM & OpenVZ virtualization based IaaS open source cloud virtualization platforms: OpenNode, Proxmox VE. Paper presented at the Intelligent Systems and Informatics (SISY), 2012 IEEE 10th Jubilee International Symposium on.
Kuhn, H. W. (2014). Nonlinear programming: a historical view Traces and Emergence of Nonlinear Programming (pp. 393-414): Springer.
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group Research Note, 6.
Mashey, J. R. (1997). Big Data and the Next Wave of InfraS-tress. Paper presented at the Computer Science Division Seminar, University of California, Berkeley.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think: Houghton Mifflin Harcourt.
Mell, P., & Grance, T. (2011). The NIST definition of cloud computing.
Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
Noll, M. G. (2007). Running hadoop on ubuntu linux (single-node cluster). Mar-2013.[Online]. Available: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-nodecluster/ [Accessed:12-Jun-2013].
Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.
Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., & Epema, D. (2010). A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing. In D. Avresky, M. Diaz, A. Bode, B. Ciciani & E. Dekel (Eds.), Cloud Computing (Vol. 34, pp. 115-131): Springer Berlin Heidelberg.
Pallis, G. (2010). Cloud Computing: The New Frontier of Internet Computing. Internet Computing, IEEE, 14(5), 70-73. doi: 10.1109/MIC.2010.113
Panjwani, M. L., & Makhijani, R. K. (2013). Distributed Data Mining and Approaches.
Petre, R. S. (2012). Data mining in cloud computing. Database Systems Journal, 3(3), 67-71.
Rajaraman, A. (2008). More data usually beats better algorithms. Datawocky Blog.
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Paper presented at the Collaboration Technologies and Systems (CTS), 2013 International Conference on.
Shackelford, R., McGettrick, A., Sloan, R., Topi, H., Davies, G., Kamali, R., . . . Lunt, B. (2006). Computing curricula 2005: The overview report. ACM SIGCSE Bulletin, 38(1), 456-457.
Spath, D., Ganschar, O., Gerlach, S., Hämmerle, M., Krause, T., & Schlund, S. (2013). Produktionsarbeit der Zukunft-Industrie 4.0: Fraunhofer Verlag.
Sugerman, J., Venkitachalam, G., & Lim, B.-H. (2001). Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. Paper presented at the USENIX Annual Technical Conference, General Track.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining (Vol. 1): Pearson Addison Wesley Boston.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5), 988-999.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.
Wang, L., Von Laszewski, G., Younge, A., He, X., Kunze, M., Tao, J., & Fu, C. (2010). Cloud computing: a perspective study. New Generation Computing, 28(2), 137-146.
White, T. (2009). Hadoop: the definitive guide: the definitive guide: " O'Reilly Media, Inc.".
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on(3), 408-421.
Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3), 257-286.
Xindong, W., Xingquan, Z., Gong-Qing, W., & Wei, D. (2014). Data mining with big data. Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107. doi: 10.1109/TKDE.2013.109
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., . . . Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Paper presented at the Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. Paper presented at the Proceedings of the 2nd USENIX conference on Hot topics in cloud computing.
Zaki, M. J. (2000). Parallel and distributed data mining: An introduction Large-Scale Parallel Data Mining (pp. 1-23): Springer.
城田真琴. (2013). Big Data大數據的獲利模式: 圖解.案例.策略.實戰: 經濟新潮社出版.