跳到主要內容

簡易檢索 / 詳目顯示

研究生: 鄭博謙
Po-Chien Cheng
論文名稱: 基於 Kubernetes 與 OpenFaaS 之分散式多源數據工作流處理系統
Distributed workflow processing for multisource data streams based on Kubernetes and OpenFaaS
指導教授: 王尉任
Wei-Jen Wang
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 54
中文關鍵詞: 微服務KafkaETLKubernetesFunctions as a Services
外文關鍵詞: Microservice, Kafka, ETL, Kubernetes, Functions as a Services
相關次數: 點閱:21下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今大數據時代,收集多樣化的數據已經成為企業資源重要的一環,其中 ETL 是一種數據資料取得、處理、分析常見的作法,將不同數據源收集的資料經過一連串資料前處理、聚合、過濾,最後將數據保存下來以供後續分析使用。由於數據源不同需使用特定程式語言建立每個資料流,隨著系統範圍及規模越來越大,造成數據流的可維護、管理性降低。近年來容器化技術發展迅速,許多軟體服務開始以微服務的形式部署運行,藉由編排容器工具提供跨主機叢集的自動部署、擴展,確保服務在可用節點上運行,但當微服務未被使用時會占用系統資源,也因此逐漸演變出新的無服務器概念。因此,本研究設計出一套分散式多源數據之工作流處理機制,以 Kafka 作為多種輸入、輸出數據源資料暫存平台。透過工作流管理器定義處理數據流所需執行的條件與步驟,其中將步驟封裝成無服務器函數 ( FaaS ) 部署在 Kubernetes 中以提供給工作流管理器調用,透過監控 FaaS Gateway 流量來自動擴展服務以處理流量高峰,若服務未使用時可自動縮減數量以降低系統資源使用量,最終將處理過的數據儲存至外部數據倉庫儲存。


    In today's big data era, collecting diverse data has become an important part of enterprise resources. ETL is a common practice for data acquisition, processing, and analysis. Finally, the data is saved for subsequent analysis. Due to the different data sources, each data stream needs to be established using a specific programming language. As the scope and scale of the system increase, the maintainability and management of the data stream are reduced. In recent years, containerization technology has developed rapidly, and many software services have begun to be deployed and run in the form of microservices. The orchestration container tool provides automatic deployment and expansion across host clusters to ensure that services run on available nodes, but when microservices are not used. It will take up system resources, and thus gradually evolve into a new serverless concept. Therefore, this research designs a set of distributed multi-source data workflow processing mechanism, using Kafka as a staging platform for various input and output data sources. Define the conditions and steps required to process the data flow through the workflow manager, in which the steps are encapsulated into serverless functions (FaaS) and deployed in Kubernetes to provide the workflow manager to call, and automatically expand the service by monitoring the FaaS Gateway traffic In order to deal with traffic peaks, if the service is not in use, the number can be automatically reduced to reduce system resource usage, and finally the processed data is stored in an external data warehouse for storage.

    目錄 摘要 i Abstract ii 目錄 iii 表目錄 v 圖目錄 v 一、 緒論 1 1-1 研究背景 1 1-2 研究動機與目的 2 1-3 論文架構 2 二、 背景知識 3 2-1 Kubernetes 3 2-2 Confluent 5 2-2-1 Apache Kafka 6 2-2-2 Kafka Connect 8 2-2-3 Schema Registry 9 2-2-4 REST Proxy 9 2-3 OpenFaaS 10 2-4 Lightflow 11 2-5 Argo Event 12 三、 相關研究與討論 14 3-1 大數據平台處理異構數據源應用相關研究 14 3-2 Serverless ETL Pipelines應用相關研究 14 3-3 Workflow Management System相關研究 15 四、 系統設計 16 4-1 系統架構 16 4-2 Source & Sink Component 19 4-3 Modularized functions into OpenFaaS 20 4-4 Workflow Management System 21 4-5 Configs Service Component 23 4-6 Runner Service Component 25 4-7 REST API 27 五、 案例研究 33 5-1 ETL – Extract Transform Load 33 5-2 分析 ETL 面臨問題 34 5-3 整合系統分析 35 5-3-1 導入 Kafka Connect 35 5-3-2 導入 OpenFaaS 36 5-3-3 導入 Workflow system 37 5-3-4 探討系統可維護和管理性 38 六、 結論與未來研究方向 39 參考文獻 40

    [1] N. Saranya, R. Brindha, N. Aishwariya, R. Kokila, P. Matheswaran, and P. Poongavi, "Data Migration using ETL Workflow," in 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), 2021, vol. 1: IEEE, pp. 1661-1664.
    [2] C. Pahl, A. Brogi, J. Soldani, and P. Jamshidi, "Cloud container technologies: a state-of-the-art review," IEEE Transactions on Cloud Computing, vol. 7, no. 3, pp. 677-692, 2017.
    [3] C.-Y. Fan and S.-P. Ma, "Migrating monolithic mobile application to microservice architecture: An experiment report," in 2017 ieee international conference on ai & mobile services (aims), 2017: IEEE, pp. 109-112.
    [4] L. De Lauretis, "From monolithic architecture to microservices architecture," in 2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), 2019: IEEE, pp. 93-96.
    [5] H. Vural, M. Koyuncu, and S. Guney, "A systematic literature review on microservices," in International Conference on Computational Science and Its Applications, 2017: Springer, pp. 203-217.
    [6] W. Lloyd, S. Ramesh, S. Chinthalapati, L. Ly, and S. Pallickara, "Serverless computing: An investigation of factors influencing microservice performance," in 2018 IEEE International Conference on Cloud Engineering (IC2E), 2018: IEEE, pp. 159-169.
    [7] T. Lynn, P. Rosati, A. Lejeune, and V. Emeakaroha, "A preliminary review of enterprise serverless cloud computing (function-as-a-service) platforms," in 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2017: IEEE, pp. 162-169.
    [8] A. P. Ferreira and R. Sinnott, "A performance evaluation of containers running on managed kubernetes services," in 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2019: IEEE, pp. 199-208.
    [9] ConfluentInc. "Confluent: Data in Motion." @ConfluentInc. https://www.confluent.io/ (accessed 11, 2021).
    [10] K. M. M. Thein, "Apache kafka: Next generation distributed messaging system," International Journal of Scientific Engineering and Technology Research, vol. 3, no. 47, pp. 9478-9483, 2014.
    [11] "Kafka Connect | Confluent Documentation." Configuration Properties. https://docs.confluent.io/platform/current/connect/index.html (accessed 11, 2021).
    [12] "Schema Registry Overview | Confluent Documentation." Configuration Properties. https://docs.confluent.io/platform/current/schema-registry/index.html (accessed 11, 2021).
    [13] "Confluent REST APIs | Confluent Documentation." Configuration Properties. https://docs.confluent.io/platform/current/kafka-rest/index.html (accessed 11, 2021).
    [14] O. Ltd. "OpenFaaS." https://www.openfaas.com/ (accessed 11, 2021).
    [15] "Lightflow 1.11.1 documentation." https://lightflow.readthedocs.io/en/latest/installation.html (accessed 11, 2021).
    [16] "Celery | Github." github.com/celery/celery (accessed 11, 2021).
    [17] "Argo Events - The Event-Based Dependency Manager for Kubernetes." https://argoproj.github.io/argo-events/ (accessed 11, 2021).
    [18] A. Akanbi and M. Masinde, "A distributed stream processing middleware framework for real-time analysis of heterogeneous data on big data platform: case of environmental monitoring," Sensors, vol. 20, no. 11, p. 3166, 2020.
    [19] A. Pogiatzis and G. Samakovitis, "An event-driven serverless ETL pipeline on AWS," Applied Sciences, vol. 11, no. 1, p. 191, 2020.
    [20] E. Van Eyk et al., "The spec-rg reference architecture for faas: From microservices and containers to serverless platforms," IEEE Internet Computing, vol. 23, no. 6, pp. 7-18, 2019.

    QR CODE
    :::