跳到主要內容

簡易檢索 / 詳目顯示

研究生: 王建文
Jiann-Wen Wang
論文名稱: 基於libvirt與QEMU-KVM虛擬機器之記憶體層級同步容錯系統
An Adaptive Continuous Checkpointing Fault-Tolerant Virtual Machine System based on QEMU-KVM with libvirt
指導教授: 梁德容
Deron Liang
王尉任
Wei-Jen Wang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 57
中文關鍵詞: QEMU-KVMLibvirt虛擬機器容錯系統持續同步
外文關鍵詞: QEMU-KVM, Libvirt, Virtual Machine, Fault Tolerance, Continuous Checkpointing
相關次數: 點閱:15下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著雲端計算與虛擬化技術的快速發展,資訊產業得以利用相關技術提升實體機器的利用效率並達成彈性的資源分配;然而在將多個伺服器整合到同一實體機器之時,也產生單一主機硬體故障即會導致多個服務失效的問題。基於虛擬化技術的容錯系統可以在主機硬體發生故障時,保護關鍵服務之虛擬機器運作狀態與其執行的 soft real-time 程式,進一步提升服務的可用性。
    本研究基於 QEMU 3.0.0 、 libvirt 5.7.0 與持續同步的架構實作可透過外部管理介面控制的容錯系統,其中的持續同步架構藉由不斷同步主要虛擬機器與備援虛擬機器的狀態、並保證對外輸出的一致性,以達到容錯系統之基本要求。同時本研究以引入壓縮工具降低同步所需之頻寬、感知虛擬機器工作負載並進行參數設定等方式,協助系統管理者提升服務於容錯系統運作之效能。


    The IT industries have commonly adopted the concept of cloud computing and virtualization, making resource management more efficient and elastic. However, as more servers are consolidated into one physical server, availability will be threatened by a single physical host's hardware failure. A virtualization-based fault-tolerant system can protect mission-critical virtual machines running soft real-time applications from such hardware failures, thus improving the services' availability.
    Based on QEMU 3.0.0, libvirt 5.7.0, and continuous checkpointing, this study implements a virtualization-based fault-tolerant system with a management interface. Continuous checkpointing keeps replicating internal states of VM on the primary host to backup host to meet the requirements of fault tolerance, and outputs are buffered to ensure consistency. This study also designed and implemented two methods to reduce the performance degradation of guest applications brought by the system; by adjusting the checkpointing parameter automatically and utilizing compression tools to speed up dirty pages transfer on demand, system administrators can set up the system without finding out suitable parameter for every application and have more flexibility to deploy the system.

    摘要..............................................................................................................................................i Abstract.......................................................................................................................................ii Contents.....................................................................................................................................iii List of Figures............................................................................................................................vi List of Tables...........................................................................................................................viii I. Introduction..............................................................................................................................1 1.1 Research Background.......................................................................................................1 1.2 Motivation and Contributions..........................................................................................3 1.3 Outline..............................................................................................................................4 II. Background Knowledge.........................................................................................................5 2.1 QEMU and Kernel-based Virtual Machine......................................................................5 2.2 Libvirt...............................................................................................................................5 2.3 Types of VM Fault Tolerance Systems............................................................................6 2.3.1 Lock-Stepping...........................................................................................................6 2.3.2 Continuous Checkpointing.......................................................................................7 2.3.3 Hybrid.......................................................................................................................7 2.4 Live Migration with Compression Techniques................................................................8 III. System Design.......................................................................................................................9 3.1 Overall Architecture.........................................................................................................9 3.1.1 Checkpointing and Messaging................................................................................10 3.1.2 Watchdog................................................................................................................10 3.1.3 Export.....................................................................................................................10 3.1.4 Autopilot.................................................................................................................11 3.2 System Initialization.......................................................................................................11 3.3 Checkpointing Process and Network Output Correctness.............................................12 3.4 Fault Model and Fault Handling....................................................................................13 3.4.1 Fault Model Overview............................................................................................13 3.4.2 Correctness.............................................................................................................14 3.5 Libvirt Integration..........................................................................................................17 3.6 Additional Modification to QEMU................................................................................19 IV. Performance Improvements................................................................................................20 4.1 Experiment Environment...............................................................................................20 4.1.1 Environment Overview and Configuration............................................................20 4.1.2 Applications for Performance Evaluation..............................................................21 4.2 Adjusting Epoch Time Adaptively.................................................................................23 4.2.1 Finding Optimal Epoch Time with Manual Experiments.......................................23 4.2.2 Probing the Moving Average Online......................................................................26 4.3 Utilizing Compression Techniques................................................................................28 4.3.1 Implementation of Compressing Checkpoints........................................................28 4.3.2 Performance Evaluation on Compressing Checkpoints.........................................30 V. Evaluation.............................................................................................................................32 5.1 Experiment Environment...............................................................................................32 5.2 Experiment Results.........................................................................................................33 5.2.1 TPC-C OLTP Database Benchmark.......................................................................33 5.2.2 Acme Air in NodeJS...............................................................................................34 5.2.3 Kernel Compilation................................................................................................35 5.2.4 Network Latency of Idle Guest...............................................................................36 5.2.5 Network Throughput..............................................................................................37 VI. Related Work......................................................................................................................38 6.1 Virtual Machine Fault Tolerance...................................................................................38 6.1.1 Continuous Checkpointing Implementations.........................................................38 6.2 Live Migration with Lossless Compression Algorithms................................................40 6.2.1 XOR-Based Zero Run Length Encoding (XBZRLE).............................................40 6.2.2 LZ4 Lossless Compression.....................................................................................40 VII. Conclusion and Future Work.............................................................................................41 References.................................................................................................................................42

    [1] M. Armbrust et al., “A View of Cloud Computing,” Commun ACM, vol. 53, pp. 50–58,
    Apr. 2010, doi: 10.1145/1721654.1721672.
    [2] Armbrust et al., “Above the Clouds: A Berkeley View of Cloud Computing,” Jan.
    2009.
    [3] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud Computing and
    Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th
    Utility,” Future Gener. Comput. Syst., vol. 25, pp. 599–616, Jun. 2009, doi:
    10.1016/j.future.2008.12.001.
    [4] McAfee, LLC, “Cloud Market Share Report | AWS vs Azure vs Google Cloud 2019 |
    McAfee,” Oct. 25, 2019. https://www.skyhighnetworks.com/cloud-security-blog/microsoft-
    azure-closes-iaas-adoption-gap-with-amazon-aws/ (accessed Jul. 10, 2020).
    [5] VMware, Inc, “What is vSphere 7? | Server Virtualization Software | VMware.” https://
    www.vmware.com/products/vsphere.html (accessed Jul. 10, 2020).
    [6] O. Sefraoui, M. Aissaoui, and M. Eleuldj, “OpenStack: Toward an Open-Source
    Solution for Cloud Computing,” Int. J. Comput. Appl., vol. 55, pp. 38–42, Oct. 2012, doi:
    10.5120/8738-2991.
    [7] F. Bellard, “QEMU, a fast and portable dynamic translator,” in Proceedings of the
    annual conference on USENIX Annual Technical Conference, Anaheim, CA, Apr. 2005, p.
    41, Accessed: Jul. 10, 2020. [Online].
    [8] A. Qumranet, Y. Qumranet, D. Qumranet, U. Qumranet, and A. Liguori, “KVM: The
    Linux virtual machine monitor,” Proc. Linux Symp., vol. 15, Jan. 2007.
    [9] “libvirt: The virtualization API.” https://libvirt.org/ (accessed Jul. 10, 2020).
    [10] C. Clark et al., “Live Migration of Virtual Machines.,” May 2005.
    [11] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, “Cost of Virtual Machine Live
    Migration in Clouds: A Performance Evaluation,” Sep. 2011, vol. 5931, doi: 10.1007/978-
    3-642-10665-1_23.
    [12] K. Vishwanath and N. Nagappan, “Characterizing Cloud Computing Hardware
    Reliability,” Jan. 2010, pp. 193–204, doi: 10.1145/1807128.1807161.
    [13] J. Gray and D. Siewiorek, “High-Availability Computer Systems,” Computer, vol. 24,
    pp. 39–48, Oct. 1991, doi: 10.1109/2.84898.
    [14] D. Scales, M. Nelson, and G. Venkitachalam, “The design of a practical system for
    fault-tolerant virtual machines,” Oper. Syst. Rev., vol. 44, pp. 30–39, Dec. 2010, doi:
    10.1145/1899928.1899932.
    [15] P.-J. Tsao, Y.-F. Sun, L.-H. Chen, and C.-Y. Cho, “Efficient Virtualization-Based
    Fault Tolerance,” Dec. 2016, pp. 114–119, doi: 10.1109/ICS.2016.0031.
    [16] C. Wang et al., “PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-
    tolerance,” Apr. 2018.
    [17] Y. Dong et al., “COLO: COarse-grained LOck-stepping virtual machines for non-stop
    service,” presented at the Proceedings of the 4th Annual Symposium on Cloud Computing,
    SoCC 2013, Oct. 2013, doi: 10.1145/2523616.2523630.
    [18] A. Souza, A. Papadopoulos, L. Tomás, D. Gilbert, and J. Tordsson, “Hybrid Adaptive
    Checkpointing for Virtual Machine Fault Tolerance,” Apr. 2018, pp. 12–22, doi:
    10.1109/IC2E.2018.00023.
    [19] M. Pereira da Silva, R. Obelheiro, and G. Koslovski, “Adaptive Remus : adaptive
    checkpointing for Xen-based virtual machine replication,” Int. J. Parallel Emergent Distrib.
    Syst., vol. 32, pp. 1–20, Mar. 2016, doi: 10.1080/17445760.2016.1162302.
    [20] “qemu git repository: docs/COLO-FT.txt,” GitHub. https://github.com/qemu/qemu
    (accessed Jul. 10, 2020).
    [21] R. Russell, “virtio: towards a de-facto standard for virtual I/O devices.,” Oper. Syst.
    Rev., vol. 42, pp. 95–103, Jan. 2008.
    [22] Red Hat,Inc., “Introduction to virtio-networking and vhost-net.”
    https://www.redhat.com/en/blog/introduction-virtio-networking-and-vhost-net (accessed
    Jul. 10, 2020).
    [23] Advanced Micro Devices Inc., “AMD64 Architecture Programmer’s Manual, Volume
    2: System Programming; Chapter 15: Secure Virtual Machine,” p. 714, 2020.
    [24] Intel Corporation, “Intel® 64 and IA-32 Architectures Software Developer’s Manual,
    Volume 3C: System Programming Guide, Part 3; Part 3: CHAPTER 23, INTRODUCTION
    TO VIRTUAL MACHINE EXTENSIONS,” p. 730.
    [25] “libvirt: Applications using libvirt.” https://libvirt.org/apps.html (accessed Jul. 10,
    2020).
    [26] “Documentation/QMP - QEMU.” https://wiki.qemu.org/Documentation/QMP
    (accessed Jul. 10, 2020).
    [27] T. Bressoud and F. Schneider, “Hypervisor-Based Fault Tolerance.,” ACM Trans
    Comput Syst, vol. 14, pp. 80–107, Feb. 1996, doi: 10.1145/224056.224058.
    [28] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield, “Remus:
    High Availability via Asynchronous Virtual Machine Replication,” Apr. 2008.
    [29] “Features/MicroCheckpointing - QEMU.”
    https://wiki.qemu.org/Features/MicroCheckpointing (accessed Jul. 12, 2020).
    [30] Y. Tamura, K. Sato, S. Kihara, and S. Moriai, “Kemari: virtual machine
    synchronization for fault tolerance,” Jan. 2008.
    [31] “VMware vSphere 6 Fault Tolerance: Architecture and Performance,” Fault Toler., p.
    21.
    [32] P. Svärd, B. Hudzia, J. Tordsson, and E. Elmroth, “Evaluation of Delta Compression
    Techniques for Efficient Live Migration of Large Virtual Machines,” Jul. 2011, vol. 46, pp.
    111–120, doi: 10.1145/2007477.1952698.
    [33] L. Li and Y. Zhang, “KVM Live Migration Optimization - KVM Forum 2015.” http://
    www.linux-kvm.org/images/b/b3/02x-09-Cedar-Liang_Li-
    KVMLiveMigrationOptimization.pdf (accessed Jul. 10, 2020).
    [34] X. Song, J. Shi, R. Liu, J. Yang, and H. Chen, “Parallelizing Live Migration of Virtual
    Machines,” ACM SIGPLAN Not., vol. 48, Mar. 2013, doi: 10.1145/2451512.2451531.
    [35] M. Hines, U. Deshpande, and K. Gopalan, “Post-copy live migration of virtual
    machines,” Oper. Syst. Rev., vol. 43, pp. 14–26, Jul. 2009, doi: 10.1145/1618525.1618528.
    [36] “Features/AutoconvergeLiveMigration - QEMU.”
    https://wiki.qemu.org/Features/AutoconvergeLiveMigration (accessed Jul. 10, 2020).
    [37] “qemu git repository: docs/xbzrle.txt,” GitHub. https://github.com/qemu/qemu
    (accessed Jul. 10, 2020).
    [38] “open(2) - Linux manual page.” https://man7.org/linux/man-pages/man2/open.2.html
    (accessed Jul. 10, 2020).
    [39] “ChangeLog/2.10 - QEMU.”
    https://wiki.qemu.org/ChangeLog/2.10#Block_devices_and_tools (accessed Jul. 10, 2020).
    [40] “fcntl(2) - Linux manual page.”
    https://www.man7.org/linux/man-pages/man2/fcntl.2.html (accessed Jul. 10, 2020).
    [41] “Percona-Lab/tpcc-mysql,” Jul. 10, 2020. https://github.com/Percona-Lab/tpcc-mysql
    (accessed Jul. 10, 2020).
    [42] “acmeair/acmeair-nodejs,” Jul. 07, 2020. https://github.com/acmeair/acmeair-nodejs
    (accessed Jul. 10, 2020).
    [43] “Node.js Benchmarking.” https://benchmarking.nodejs.org/ (accessed Jul. 10, 2020).
    [44] “lz4/lz4,” Aug. 15, 2020. https://github.com/lz4/lz4 (accessed Aug. 16, 2020).

    QR CODE
    :::