| 研究生: |
阿斐奇 Afiqie Fadhihansah |
|---|---|
| 論文名稱: | A Practical Log and Replay Strategy for VM Fault Tolerance |
| 指導教授: |
梁德榮
Deron Liang M. Aziz Muslim M. Aziz Muslim Muhammad Aswin Muhammad Aswin |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 日誌和重放 、容錯 、虛擬機 |
| 外文關鍵詞: | Log-and-replay, fault tolerance, virtual machine |
| 相關次數: | 點閱:12 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
虛擬化是一種計算機體系結構技術,通過該技術,多個虛擬機(VM)在同一硬件機器中被復用。虛擬機的目的是增強許多用戶的資源共享,並且在資源利用和應用靈活性方面改進計算機性能。可以在各種功能層中虛擬化硬件資源(CPU,存儲器,I / O設備等)或軟件資源(操作系統和軟件庫)。這種虛擬化技術隨著近年來對分佈式和雲計算的需求急劇增加而得到重振。
容錯不僅僅是單個機器的屬性;它也可以表徵它們相互作用的規則。例如,傳輸控制協議(TCP)被設計為在分組交換網絡中允許可靠的雙向通信,即使在存在不完全或過載的通信鏈路的情況下。它通過要求通信的端點期望分組丟失,複製,重新排序和損壞來進行,使得這些條件不會損害數據完整性,並且僅以成比例的量減少吞吐量。
在容錯虛擬機中設計的最重要的要求是確保它實際上滿足其對可靠性的要求。我們對這個問題的解決方案採用虛擬機記錄和重放的形式。通過記錄關於系統執行的足夠信息,我們能夠在稍後的時間重放執行,重複所有非確定性事件,正如它們在原始執行中發生的那樣。我們已將日誌記錄和重放機制集成到用於Linux的基於內核的虛擬機(KVM)開源全系統虛擬化軟件包中。
最後,關於VM容錯的實際日誌和重放策略的研究結果是,當輸出需要執行時,主要應該將數據事件傳輸到備份,然後允許主要執行輸出。在執行輸出後,主要應該通知備份,並且如果接收到通知,備份將不執行輸出,並且如果不接收則執行輸出
Virtualization is a computer architecture technology by which multiple virtual machines (VMs) are multiplexed in the same hardware machine. The purpose of a virtual machine is to enhance resource sharing by many users and improve computer performance in terms of resource utilization and application flexibility. Hardware resource (CPU, memory, I/O devices, etc.) or software resources (operating system and software libraries) can be virtualized in various functional layers. This virtualization technology has been revitalized as the demand for distributed and cloud computing which increased sharply in recent years.
Fault tolerance is not just a property of individual machines; it may also characteristic the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.
The most important requirement of design in a fault tolerant virtual machine is making sure it actually meets its requirements for reliability. Our solution to this problem takes the form of virtual machine logging and replay. By logging enough information about the execution of the system, we are able to replay the execution at a later time, repeating all non-deterministic events exactly as they occurred in the original execution. We have integrated the logging and replay mechanisms into the Kernel-based Virtual Machine (KVM) open-source full-system virtualization package for Linux.
Finally, the result of this research about a practical log and replay strategy for VM fault tolerance is that primary should transfer data events to backup when output need to be executed, then primary will be allowed to execute the output. After output been performed, primary should notify backup, and backup will not perform output if received notification, and do the output if not receiving.
[1] Habib I. Virtualization with KVM. Linux Journal 2008; 2008(166). Article No. 8.
[2] Uhlig R, Neiger G, Rodgers D, Santoni A, Martins F, Anderson A, Bennett S, Kagi A, Leung F, Smith L. Intel virtualization technology. Computer 2005; 38(5):48–56.
[3] AMD. AMD64 Virtualization Codenamed “Pacifica” Technology: Secure Virtual Machine Architecture Reference Manual. Advanced Micro Devices: Sunnyvale, CA, 2005. AMD Publication No. 33047
[4] Russell R. Virtio: towards a de-facto standard for virtual I/O devices. ACM SIGOPS Operating Systems Review 2008; 42(5):95–103
[5] S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The Design and Implementation of Zap: A System for Migrating Computing Environments”, Proc. USENIX OSDI, 2002.
[6] H. Zhong and J. Nieh, “Linux Checkpoint/Restart As a Kernel Module”, Technical Report CUCS-014-01, Department of Computer Science, Columbia University, 2001.
[7] J. Sankaran, J. M. Squyres, B. Barret, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, “The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing”, Proceedings of the LACSI Symposium, 2003.
[8] G. E. Fagg and J. Dongarra, “FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, Proc. The 7th European PVM/MPI User’s GroupMeeting, LNCS, Vol.1908, 2000.
[9] Y. Chen, J. S. Plank, and K. Li, “CLIP: A Checkpointing Tool for Message-Passing Parallel Programs”, Proc. IEEE Supercomputing, 1997
[10] G. Stellner, “CoCheck: checkpointing and process migration for MPI”, Proc. of IPPS’96, 1996.
[11] S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. “The LAM/MPI checkpoint/restart framework: System-initiated checkpointing”, Proc. LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
[12] G. Bosilca, A. Boutellier, and F. Cappello, “MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes”, Proc. Supercomputing, Nov. 2002.
[13] R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski, “Architecture of LA-MPI, a network-fault-tolerant MPI”, Proc. International Parallel and Distributed Processing Symposium, 2004
[14] J. Duell, “The design and implementation of berkeley lab’s linux checkpoint/restart”, Technical Report, Lawrence Berkeley National Laboratory, 2000.
[15] A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott, “Proactive Fault Tolerance for HPC with Xen Virtualization”, Proc. ACM International Conference on Supercomputing, 2007.
[16] B. Cully, G. Lefebvre, D. Meyer, M. Freeley, N. Hutchinson, and A. Warfield, “Remus: High Availability via Asynchronous Virtual Machine Replication”, Proc. USENIX NSDI, 2008.
[17] Y. Tamura, K. Sato, S.Kihara, and S. Moriai, “Kemari: virtual machine synchronization for fault tolerance”, Proc. USENIX'08 Poster Session, San Jose, CA, USA, 2008.
[18] Melliar-Smith, P. M. “A Project to Investigate Data-base Reliability”, Report, Computing Lab., University of Newcastle-upon-Tyne, England, 1975.
[19] "Intel 64 and IA-32 Architectures Software Developer's Manual: Volume 2A: Instruction Set Reference, A-M" (PDF). Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel Corporation. June 2010. pp. 3–520.
[20] "Intel 64 and IA-32 Architectures Software Developer's Manual: Volume 2B: Instruction Set Reference, N-Z" (PDF). Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel Corporation. June 2010. pp. 4–22.
[21] "AMD64 Architecture Programmer's Manual: Volume 3: General-Purpose and System Instructions" (PDF). AMD64 Architecture Programmer's Manual. Advanced Micro Devices. November 2009. pp. 117, 181. Retrieved 2010-08-21.
[22] ARM Cortex-A Series Programmer's Guide. Literature number ARM DEN0013D. pp. 10–3.
[23] Paolo Bonzini et al. QEMU. http://wiki.qemu.org/Main_Page. (Access on December 2016)
[24] Irv Englander. 2009. “The Architecture of Computer Hardware, System Software, and Networking 4th Edition”. Danvers: John Wiley and Sons, Inc.