| 研究生: |
林彥豪 Yen-Hao Lin |
|---|---|
| 論文名稱: |
基於 SQL 查詢負載評估的 Presto auto-scaling 機制 Presto auto-scaling mechanism based on SQL query workload estimation |
| 指導教授: |
王尉任
Wei-Jen Wang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系在職專班 Executive Master of Computer Science & Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 69 |
| 中文關鍵詞: | Presto 、Auto Scaling 、機器學習 、資源管理 、雲端成本 |
| 外文關鍵詞: | Presto, Auto Scaling, Machine Learning, Resource Management, Cloud Cost |
| 相關次數: | 點閱:26 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著大數據時代數據量快速增長,對於高效數據處理技術的需求日益迫切。 Massively Parallel Processing (MPP) 技術作為處理大規模數據查詢的架構,其中開源的 Presto SQL 查詢引擎因其高效能而廣受關注。然而 Presto 原生的靜態資源配置策略難以適應動態變化的查詢負載,會導致資源利用率低或不必要的浪費,進而影響查詢效能與成本效益。為解決此問題,本研究提出一種基於 SQL 查詢語句進行負載評估的 Presto Auto Scaling 機制。該機制整合了 TF-IDF 向量化進行 SQL 語句的特徵提取,並利用 XGBoost 建立分類模型來預測資源的使用範圍,結合基礎設施即代碼 (Infrastructure as Code) 技術與 Apache Airflow 自動化工作流程,實現 Presto 工作節點的調整。本研究在 Oracle Cloud Infrastructure 環境中,使用 TPC-DS Benchmark 測試集對該機制進行了驗證。實驗結果表明,相較於一個 Coordinator 搭配三個 Worker 的配置,本研究的 Auto Scaling 機制使 99 筆 TPC-DS 查詢的總執行時間減少了14%,並且 92% 的查詢延遲下降,其中聚合查詢的查詢延遲降低了 20%。且得益於查詢執行的時間降低,整體的成本效益提升將近 5%,然而整體雲端運算成本增加了 12%,主要因平均節點數從 4 個增加至 5.32 個所致。儘管查詢效能得到優化,實驗亦觀察到 69% 的查詢記憶體使用量有所上升,平均增加 15-20%,間接導致查詢執行時間未達預期低值;初步分析可能歸因於在增加 Presto Worker 之後,資料在 Worker 之間交互傳遞所造成的影響。這些結果顯示在動態資源配置中,實現查詢效能與成本控制之間的權衡仍面臨挑戰。
With the rapid growth of data in the era of big data, the demand for efficient data processing technologies has become increasingly urgent. Massively Parallel Processing (MPP) technology has emerged as a key architecture for handling large-scale data queries, and among MPP solutions, the open-source Presto SQL query engine has gained wide attention for its high performance. However, Presto's native static resource allocation strategy struggles to adapt to dynamically changing query workloads, often leading to low resource utilization or unnecessary waste, which in turn affects query performance and cost-efficiency. To address this issue, this study proposes a Presto Auto Scaling mechanism based on workload assessment through SQL queries. The mechanism integrates TF-IDF vectorization for SQL feature extraction and leverages an XGBoost-based classification model to predict the required resource range. It also combines Infrastructure as Code (IaC) practices and Apache Airflow automation workflows to dynamically adjust the number of Presto worker nodes. The proposed mechanism was validated in an Oracle Cloud Infrastructure environment using the TPC-DS benchmark dataset. Experimental results show that, compared to a static configuration with one coordinator and three workers, the Auto Scaling mechanism reduced the total execution time of 99 TPC-DS queries by 14%, with 92% of the queries experiencing reduced latency. In particular, the latency of aggregate queries decreased by 20%. Additionally, due to the shortened execution times, overall cost-efficiency improved by nearly 5%. However, total cloud computing costs increased by 12%, mainly because the average number of worker nodes rose from 4 to 5.32. Despite the improvements in query performance, the experiments also observed increased memory usage in 69% of the queries, with an average increase of 15–20%, which indirectly prevented execution times from reaching their expected lows. Preliminary analysis suggests that this may be due to the increased data exchange overhead among worker nodes after scaling out. These results indicate that in dynamic resource allocation scenarios, achieving a balance between query performance and cost control remains a challenging task.
[1] IDC, "Revelations in the Global DataSphere , 2024: Key Trends and Takeaways," 2024. https://my.idc.com/getdoc.jsp?containerId=prCHC52667624 (Accessed 12, 2024)
[2] Amazon Web Services, Inc, "Cloud Computing Concepts Hub," https://aws.amazon.com/what-is/presto/?nc1=h_ls (Accessed 12, 2024)
[3] The Presto Foundation. "Presto Official Website" https://prestodb.io/ (Accessed 12, 2024)
[4] The Apache Software Foundation. "Apache Airflow Official Website" https://airflow.apache.org/ (Accessed 12, 2024)
[5] Cloud Software Group, Inc. "TIBCO: What is massively parallel processing". https://www.tibco.com/ (Accessed 12, 2024)
[6] HashiCorp. "Terraform Official Documentation." https://developer.hashicorp.com/terraform (Accessed 12, 2024)
[7] Sethi, R., et al. (2019). "Presto: Sql on everything," in 2019 IEEE 35th International Conference on Data Engineering (ICDE) (Macao: IEEE), pp. 1802–1813.
[8] S. Gourishetti, "Performance Optimization in Distributed SQL Environments: A Comprehensive Analysis of Presto Query Engine," International Journal of Scientific Research in Computer Science Engineering and Information Technology, vol. 10, no. 6, pp. 241-253, November 2024.
[9] P. Vemasani, Sai M. Vuppalapati, "Achieving Agility through Auto-Scaling: Strategies for Dynamic Resource Allocation in Cloud Computing," International Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 12, no. 4, pp. 3169–3177, Apr. 2024.
[10] S. Alharthi, et al., "Auto-Scaling Techniques in Cloud Computing: Issues and Research Directions," Sensors, vol. 24, no. 17, p. 5551, Aug. 2024.
[11] C. Tang, et al., "Forecasting SQL Query Cost at Twitter," in Proc. 2021 IEEE International Conference on Cloud Engineering (IC2E), 2021, pp. 154-160.
[12] Oracle. "Cloud Cost Estimator." https://www.oracle.com/tw/cloud/costestimator.html (Accessed 12, 2024)
[13] TPC. "TPC BENCHMARK ™ DS." https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp (Accessed 12, 2024)
[14] The Presto Foundation. "Presto Concepts Overview." https://prestodb.io/docs/0.291/overview/concepts.html#overview
[15] Oracle. "Oracle Cloud Infrastructure Documentation." https://docs.oracle.com/en-us/iaas/Content/dev/terraform/tutorials/tf-simple-infrastructure.htm
[16] The Presto Foundation. "Python Client | Presto document." https://prestodb.io/docs/current/clients/python.html
[17] E. W. Davis, "Application of the Massively Parallel Processor to Database Management Systems," in Proceedings of the National Computer Conference, May 16-19, 1983. pp. 299-307.