| 研究生: |
唐崇祐 Chung-Yu Tang |
|---|---|
| 論文名稱: |
應用多模態語言模型於模擬機器人自主任務規劃與執行之研究 A Study on the Application of Multimodal Large Language Model for Autonomous Task Planning and Execution in Simulated Robots |
| 指導教授: |
蘇木春
Mu-Chun Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 132 |
| 中文關鍵詞: | 多模態語言模型 、任務規劃 、自然語言指令 、原子動作序列 、語意推理 |
| 外文關鍵詞: | Multimodal Language Model, Task Planning, Natural Language Instruction, Atomic Action Sequence, Semantic Reasoning |
| 相關次數: | 點閱:98 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著多模態大型語言模型(MLLM)在語意理解與推理方面的突破,如何將其應用於實際機器人的任務規劃與執行流程,成為當前智慧機器人研究中的關鍵課題之一。儘管如 GR00T 等系統展示了強大的跨模態整合與操作能力,但其高昂的計算成本與硬體需求,對於多數研究機構而言難以複製與部署。因此,本研究提出一個基於輕量語言模型的通用型任務規劃框架,整合常識知識圖譜推理、語義理解模組以及 Phi-4-mini-reasoning 模型,實現自然語言指令的結構化轉換,最終生成可執行的原子動作序列並交由模擬機器人完成任務。
本系統支援以抽象語句下達高層指令,透過結構化語意與可操作物件資訊,推理並產生合乎邏輯的行動序列。我們設計實驗採用多項代表性任務場景,涵蓋分類、堆疊、需求推理與視覺辨識等行為,驗證語言模型對於高階語意、空間推理與行動轉譯的能力。實驗顯示,透過合理的提示設計與語意規範,本系統不僅能穩定完成任務,亦具備與 GPT-4 同等的拆解能力;特別是在中階模型如 GPT-3.5 與 Phi-3.5-mini 上,透過 Chain-of-Thought 與 Few-shot 策略可顯著提升模型表現。整體結果證明本方法於有限資源條件下,仍能有效支援語言導向的自主任務執行流程。
With recent breakthroughs in semantic understanding and reasoning by Multimodal Large Language Models (MLLMs), how to effectively apply them to real-world robot task planning and execution has become a key challenge in the field of intelligent robotics. Although systems like GR00T have demonstrated strong multimodal integration and manipulation capabilities, their high computational cost and hardware requirements make them difficult to replicate and deploy for most research institutions. To address this, we propose a general-purpose task planning framework based on lightweight language models, integrating commonsense knowledge graph reasoning, a semantic understanding module, and the Phi-4-mini-reasoning model to structurally translate natural language instructions into executable atomic action sequences for simulated robot execution.
The proposed system supports issuing high-level commands in abstract language, and through structured semantics and operable object information, it infers and generates logically coherent action sequences. We design experiments covering various representative tasks—such as classification, stacking, navigation, and visual recognition—to validate the language model’s ability in high-level semantic understanding, spatial reasoning, and instruction-to-action translation. Experimental results show that with appropriate prompt design and semantic constraints, the system can not only execute tasks reliably but also match the decomposition performance of GPT-4. Notably, for mid-sized models such as GPT-3.5 and Phi-3.5-mini, strategies like Chain-of-Thought and Few-shot significantly improve task success rates. Overall, our findings demonstrate that this framework effectively supports language-driven autonomous task execution even under limited-resource conditions.
[1] B. Siciliano and O. Khatib, Springer Handbook of Robotics(Springer Handbooks), 1st ed.
Springer Berlin, Heidelberg, 2008.
[2] M. A. Goodrich and A. C. Schultz, Human-Robot Interaction: A Survey. 2008.
[3] OpenAI, Gpt-4 technical report, 2024.
[4] G. Team, Gemini: A family of highly capable multimodal models, 2025.
[5] H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, 2023.
[6] R. Speer, J. Chin, and C. Havasi, Conceptnet 5.5: An open multilingual graph of general
knowledge, 2018.
[7] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L.
Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár,
and C. Feichtenhofer, Sam 2: Segment anything in images and videos, 2024.
[8] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J.
Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T.
Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R.
Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang, Phi-4 technical report, 2024.
[9] J. D. Hwang, C. Bhagavatula, R. L. Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi,
Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs, 2021.
[10] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and
D. Fox, “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday
Tasks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[11] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K.
Ehsani, D. Gordon, Y. Zhu, A. Kembhavi, A. Gupta, and A. Farhadi, Ai2-thor: An interactive 3d environment for visual ai, 2022.
[12] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, Chatgpt for robotics: Design principles and model abilities, 2023.
[13] 蔡時富, “使用大型語言模型進行機器控制指令的自動化生成,” 碩士論文, 國立中
央大學資訊工程學系, 2024.
[14] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.
[15] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, Llm-planner:
Few-shot grounded planning for embodied agents with large language models, 2023.
[16] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, Reflexion:
Language agents with verbal reinforcement learning, 2023.
[17] C. A. Chen, “Cognitive home social robot with generative action planning based on commonsense knowledge base and large language models,” M.S. thesis, National Taiwan
University, Jun. 2024.
[18] NVIDIA. “Nvidia isaac gr00t. ”[Online]. Available: https://developer.nvidia.com/isaac/
gr00t.
[19] NVIDIA, Gr00t n1: An open foundation model for generalist humanoid robots, 2025.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin, Attention is all you need, 2023.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[22] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, Squad: 100,000+ questions for machine comprehension of text, 2016.
[23] A. Williams, N. Nangia, and S. R. Bowman, A broad-coverage challenge corpus for
sentence understanding through inference, 2018.
[24] M. Munnangi, A brief history of named entity recognition, 2024.
[25] J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,”
IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, Jan.
2022.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.
[27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M.
Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, Language models are few-shot learners, 2020.
[28] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham,
H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez,
A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson,
R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya,
S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D.
Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S.
Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira,
R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta,
J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, Palm: Scaling
language modeling with pathways, 2022.
[29] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N.
Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, Llama:
Open and efficient foundation language models, 2023.
[30] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S.
Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A.
Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, Training language models to
follow instructions with human feedback, 2022.
[31] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V.
Le, Finetuned language models are zero-shot learners, 2022.
[32] D. Bergmann. “What is instruction tuning? ”[Online]. Available: https://www.ibm.com/
think/topics/instruction-tuning.
[33] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V.
Le, “Finetuned language models are zero-shot learners,” in International Conference on
Learning Representations, 2022.
[34] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, Large language models are
zero-shot reasoners, 2023.
[35] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B.
Hashimoto, Stanford alpaca: An instruction-following llama model, https://github.com/
tatsu-lab/stanford_alpaca, 2023.
[36] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum,
N. M. Duc, O. Stanley, R. Nagyfi, S. ES, S. Suri, D. Glushkov, A. Dantuluri, A. Maguire,
C. Schuhmann, H. Nguyen, and A. Mattick, Openassistant conversations – democratizing large language model alignment, 2023.
[37] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M.
Zaharia, and R. Xin. “Free dolly: Introducing the world’s first truly open instructiontuned llm,” Accessed: Jun. 30, 2023. [Online]. Available: https://www.databricks.com/
blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
[38] L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, Sharegpt4v:
Improving large multi-modal models with better captions, 2023.
[39] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik, Neural network distiller: A
python package for dnn compression research, 2019.
[40] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, 2015.
[41] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, Distilbert, a distilled version of bert:
Smaller, faster, cheaper and lighter, 2020.
[42] P. Zhang, G. Zeng, T. Wang, and W. Lu, Tinyllama: An open-source small language
model, 2024.
[43] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, Minigpt-4: Enhancing vision-language
understanding with advanced large language models, 2023.
[44] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning
of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
[45] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis,
W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, Retrieval-augmented generation for
knowledge-intensive nlp tasks, 2021.
[46] O. Ayala and P. Bechard, “Reducing hallucination in structured outputs via retrievalaugmented generation,” in Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), Y. Yang, A. Davani, A. Sil, and A. Kumar, Eds., Mexico
City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 228–238.
[47] Amazon. “什麼是 rag (檢索增強生成)?. ”[Online]. Available: https://aws.amazon.
com/tw/what-is/retrieval-augmented-generation/.
[48] OpenAI. “Clip: Connecting text and images. ”[Online]. Available: https://openai.com/
index/clip/.
[49] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes,
A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L.
Schmidt, R. Kaczmarczyk, and J. Jitsev, Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
[50] S. Jo, S. Ryu, S. Kim, E. Yang, and K. Kim, Ttd: Text-tag self-distillation enhancing
image-text alignment in clip to alleviate single tag bias, 2024.
[51] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, Conceptual 12m: Pushing web-scale
image-text pre-training to recognize long-tail visual concepts, 2021.
[52] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, realtime object detection, 2016.
[53] J. Redmon and A. Farhadi, Yolov3: An incremental improvement, 2018.
[54] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, Yolov4: Optimal speed and accuracy
of object detection, 2020.
[55] R. Khanam and M. Hussain, What is yolov5: A deep look into the internal features of the
popular object detector, 2024.
[56] M. Yaseen, What is yolov8: An in-depth exploration of the internal features of the nextgeneration object detector, 2024.
[57] OpenCV Contributors, Cv::trackercsrt class reference, 2025.
[58] S. Kucuk and Z. Bingul, “Robot kinematics: Forward and inverse kinematics,” in Industrial Robotics, S. Cubero, Ed., Rijeka: IntechOpen, 2006, ch. 4.
[59] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual e5 text
embeddings: A technical report,” arXiv preprint arXiv:2402.05672, 2024.
[60] M. Sap, R. LeBras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A.
Smith, and Y. Choi, Atomic: An atlas of machine commonsense for if-then reasoning,
2019.
[61] Y. Combinator, State-of-the-art prompting for ai agents, https://podcasts.apple.com/tw/
podcast/y-combinator-startup-podcast/id1236907421, 2025.
[62] F. Robotics. “Franka emika panda. ”[Online]. Available: https://www. chenlux. com/
shop/%E5%8D%94%E4%BD%9C%E5%9E%8B%E6%A9%9F%E6%A2%B0%E4%
BA%BA/panda/.