| 研究生: |
莊尚豪 Shan-Hao Chuang |
|---|---|
| 論文名稱: |
aTCA工業電腦架構下之高可用性虛擬機器容錯系統 Using Fault-Tolerant Virtual Machines for High System Availability under Advanced Telecommunication Computing Architecture |
| 指導教授: | 王尉任 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2014 |
| 畢業學年度: | 102 |
| 語文別: | 中文 |
| 論文頁數: | 64 |
| 中文關鍵詞: | ATCA 、容錯系統 、高可用性 、虛擬機器 |
| 外文關鍵詞: | Failover |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在雲端的時代,虛擬化技術(Virtualization Technology)已被廣泛的運用,使實體伺服器可以邏輯上切割成數台虛擬機器來提供不同類型的服務。然而虛擬化技術卻會因各種原因的錯誤而造成服務中斷,例如實體機器的故障會影響執行於其上的虛擬機器,導致虛擬機器的可用性下降,連帶影響使用者使用該虛擬機器上的服務。雖然在一般電腦架構下所能偵測的錯誤及方式有限,但若在支援IPMI (Intelligent Platform Management Interface)硬體的ATCA(Advanced Telecommunications Computing Architecture)工業電腦架構下,我們就可以利用IPMI快速偵測硬體的現狀並快速解決問題。在本研究中,我們整合了ATCA工業電腦與KVM虛擬化技術,提出一個對稱型的容錯系統。系統藉由ATCA硬體加速偵測伺服器錯誤的能力,快速的將偵測到的錯誤分類且尋找出對應的回復機制。然後,容錯系統會將發生錯誤的伺服器上的虛擬機器在備援伺服器虛擬機器回復,以減輕單點故障對虛擬機器的影響。本系統最後與其它相近的虛擬化技術在同樣的硬體上測試容錯效能並進行比較,我們發現本系統在降低服務暫停時間,也就是提升可用性方面,有顯著的優勢。
The virtualization technology has been widely used in today’s cloud computing datacenters. With the virtualization technology, each physical machine in a datacenter can be logically divided into several virtual machines, on which different types of software services can host. However, many reasons may decrease the availability of the whole system. For example, a failed physical machine automatically fails all virtual machines on the physical machine, and consequently fails every software service on the virtual machines. It is difficult to detect failures efficiently in a general-purpose computer architecture because the hardware cannot provide enough information for fast failure detection. On the contrary, the ATCA (Advanced Telecommunications Computing Architecture) physical machines provide high hardware availability, and support IPMI (Intelligent Platform Management Interface) that can quickly detect the hardware status. In order to provide a solution for high system availability, we develop a novel failure model and design a symmetric fault-tolerant mechanism using ATCA physical machines and KVM accordingly in this study. The proposed fault-tolerant mechanism divides ATCA physical machines into pairs, such that each machine of a pair supports fault tolerance for each other. Once a failure is detected in the physical machine layer or the virtualization layer, the failed virtual machines are then recovered on the other physical machine. We have compared the proposed fault-tolerance mechanism with another prior VM-based fault-tolerance tool. The results show that the proposed mechanism significantly reduces the service downtime. That is, it provides better system availability for software services running on the virtual machines.
[1] K. Dooley, Designing Large Scale Lans, 1 edition. Beijing ; Sebastopol, CA: O’Reilly Media, 2001.
[2] A. Oliner and J. Stearley, “What Supercomputers Say: A Study of Five System Logs,” in 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN ’07, pp. 575–584, 2007.
[3] W. Feng, “Making a Case for Efficient Supercomputing,” Queue, vol. 1, no. 7, pp. 54–64, Oct. 2003.
[4] C.-D. Lu, Scalable Diskless Checkpointing for Large Parallel Systems. University of Illinois at Urbana-Champaign, 2005.
[5] Schroeder, Bianca, and Garth A. Gibson. "Understanding failures in petascale computers." Journal of Physics: Conference Series. Vol. 78. No. 1. IOP Publishing, 2007.
[6] J. Lang, M. Liu, Q. Wang, W. Kuehn, Z. Liu, and H. Xu, “Intelligent Platform Management Controller for ATCA Compute Nodes,” in Real Time Conference, 2009. RT ’09. 16th IEEE-NPSS, pp. 35–37, 2009.
[7] P. Perek, D. Makowski, P. Predki, and A. Napieralski, “ATCA carrier board with dedicated IPMI controller,” in Mixed Design of Integrated Circuits and Systems (MIXDES), 2010 Proceedings of the 17th International Conference, pp. 139–143, 2010.
[8] “PICMG.” [Online]. Available: https://www.picmg.org/..
[9] “Intelligent Platform Management Interface (IPMI) Information,” Intel. [Online]. Available: http://www.intel.com/content/www/us/en/servers/ipmi/ipmi-home.html.
[10] “IPMItool.” [Online]. Available: http://sourceforge.net/projects/ipmitool/
[11] Zawada, A., et al. "ATCA Carrier Board with IPMI supervisory circuit." Mixed Design of Integrated Circuits and Systems, 2008. MIXDES 2008. 15th International Conference on. IET, 2008.
[12] Ketchum, Breton A., and Viswa N. Sharma. "Shelf management controller with hardware/software implemented dual redundant configuration." U.S. Patent No. 7,827,442. 2 Nov. 2010.
[13] I. Habib, “Virtualization with KVM,” Linux J, vol. 2008, no. 166, pp. 8, Feb. 2008.
[14] Y. Goto, “Kernel-based virtual machine technology,” Fujitsu Sci. Tech. J., vol. 47, pp. 362–368, 2011.
[15] T. Hirt, “KVM-The Kernel-Based virtual machine,” Red Hat Inc, 2010.
[16] D. J. Protti, “Linux KVM as a learning tool,” Linux J., vol. 2009, no. 186, p. 3, 2009.
[17] “QEMU.” [Online]. Available: http://wiki.qemu.org/Main_Page.
[18] “libvirt: The virtualization API.” [Online]. Available: http://libvirt.org/.
[19] M. Bolte, M. Sievers, G. Birkenheuer, O. Niehörster, and A. Brinkmann, “Non-intrusive virtualization management using libvirt,” in Proceedings of the Conference on Design, Automation and Test in Europe, pp. 574–579, 2010,
[20] B. Victoria, “Creating and Controlling KVM Guests using libvirt,” Univ. Vic., 2009.
[21] I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” J. Supercomput., vol. 65, no. 3, pp. 1302–1326, Sep. 2013.
[22] R. Rajachandrasekar, X. Besseron, and D. K. Panda, “Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI,” in Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pp. 1136–1143, 2012.
[23] C.-L. Fang, D. Liang, F. Lin, and C.-C. Lin, “Fault tolerant Web Services,” J. Syst. Archit., vol. 53, no. 1, pp. 21–38, Jan. 2007.
[24] A. Muller and S. Wilson, “Virtualization with VMware ESX server,”, Syngress Publishing, 2005.
[25] P. Li, “Selecting and using virtualization solutions: our experiences with VMware and VirtualBox,” J. Comput. Sci. Coll., vol. 25, no. 3, pp. 11–17, 2010.
[26] “VMware. (2012). vSphere Availability.” Available: http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-501-availability-guide.pdf
[27] Linux Programmer's Manual : kill - send signal to a process. Available: http://man7.org/linux/man-pages/man2/kill.2.html
[28] KnowThyUbuntu. Available 2009: https://help.ubuntu.com/community/KnowThyUbuntu
[29] FIVE NINES: CHASING THE DREAM? Available: http://www.continuitycentral.com/feature0267.htm
[30] Achieving Backplane Redundancy in AdvancedTCA Systems Available: http://go.radisys.com/rs/radisys/images/paper-atca-achieving.pdf