應用多模態語言模型於模擬機器人自主任務規劃與執行之研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	唐崇祐 Chung-Yu Tang
論文名稱：	應用多模態語言模型於模擬機器人自主任務規劃與執行之研究 A Study on the Application of Multimodal Large Language Model for Autonomous Task Planning and Execution in Simulated Robots
指導教授：	蘇木春 Mu-Chun Su
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2025
畢業學年度：	113
語文別：	中文
論文頁數：	132
中文關鍵詞：	多模態語言模型、任務規劃、自然語言指令、原子動作序列、語意推理
外文關鍵詞：	Multimodal Language Model, Task Planning, Natural Language Instruction, Atomic Action Sequence, Semantic Reasoning
相關次數：	點閱：98 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著多模態大型語言模型（MLLM）在語意理解與推理方面的突破，如何將其應用於實際機器人的任務規劃與執行流程，成為當前智慧機器人研究中的關鍵課題之一。儘管如 GR00T 等系統展示了強大的跨模態整合與操作能力，但其高昂的計算成本與硬體需求，對於多數研究機構而言難以複製與部署。因此，本研究提出一個基於輕量語言模型的通用型任務規劃框架，整合常識知識圖譜推理、語義理解模組以及 Phi-4-mini-reasoning 模型，實現自然語言指令的結構化轉換，最終生成可執行的原子動作序列並交由模擬機器人完成任務。

本系統支援以抽象語句下達高層指令，透過結構化語意與可操作物件資訊，推理並產生合乎邏輯的行動序列。我們設計實驗採用多項代表性任務場景，涵蓋分類、堆疊、需求推理與視覺辨識等行為，驗證語言模型對於高階語意、空間推理與行動轉譯的能力。實驗顯示，透過合理的提示設計與語意規範，本系統不僅能穩定完成任務，亦具備與 GPT-4 同等的拆解能力；特別是在中階模型如 GPT-3.5 與 Phi-3.5-mini 上，透過 Chain-of-Thought 與 Few-shot 策略可顯著提升模型表現。整體結果證明本方法於有限資源條件下，仍能有效支援語言導向的自主任務執行流程。

With recent breakthroughs in semantic understanding and reasoning by Multimodal Large Language Models (MLLMs), how to effectively apply them to real-world robot task planning and execution has become a key challenge in the field of intelligent robotics. Although systems like GR00T have demonstrated strong multimodal integration and manipulation capabilities, their high computational cost and hardware requirements make them difficult to replicate and deploy for most research institutions. To address this, we propose a general-purpose task planning framework based on lightweight language models, integrating commonsense knowledge graph reasoning, a semantic understanding module, and the Phi-4-mini-reasoning model to structurally translate natural language instructions into executable atomic action sequences for simulated robot execution.

The proposed system supports issuing high-level commands in abstract language, and through structured semantics and operable object information, it infers and generates logically coherent action sequences. We design experiments covering various representative tasks—such as classification, stacking, navigation, and visual recognition—to validate the language model’s ability in high-level semantic understanding, spatial reasoning, and instruction-to-action translation. Experimental results show that with appropriate prompt design and semantic constraints, the system can not only execute tasks reliably but also match the decomposition performance of GPT-4. Notably, for mid-sized models such as GPT-3.5 and Phi-3.5-mini, strategies like Chain-of-Thought and Few-shot significantly improve task success rates. Overall, our findings demonstrate that this framework effectively supports language-driven autonomous task execution even under limited-resource conditions.

摘要 iv
Abstract v
誌謝 vii
目錄 xi
一、 緒論 1
1 研究動機 .................................................................. 1
2 研究目的 .................................................................. 3
3 相關研究 .................................................................. 5
3.1 基於 Sequence-to-Sequence 結構的語言到行動映射..... 5
3.2 基於大型語言模型的機器人任務指令生成 ............... 5
3.3 使用 LLM 進行控制指令生成之研究....................... 6
3.4 基於大型語言模型的任務規劃研究 ........................ 6
3.5 具認知能力之家庭型社交機器人研究 ..................... 8
3.6 基於多模態模型的通用型人形機器人研究 ............... 9
4 比較與貢獻 ............................................................... 11
4.1 與相關研究的比較 ............................................. 11
4.2 本研究的貢獻 ................................................... 12
5 論文架構 .................................................................. 13
二、 文獻回顧 14
1 大型語言模型技術概述 ................................................ 15
1.1 Transformer 架構................................................ 15
1.2 BERT 模型 ....................................................... 16
1.3 大型語言模型 ................................................... 18
1.4 指令微調與思維鏈推理技術 ................................. 18
1.5 知識蒸餾與模型壓縮 .......................................... 22
1.6 檢索增強生成技術 ............................................. 23
1.7 多模態大型語言模型 .......................................... 26
2 語義分割 .................................................................. 28
3 物件追蹤 .................................................................. 30
3.1 YOLO 系列模型 ................................................ 30
3.2 OpenCV 物件追蹤工具........................................ 31
4 機器人運動學 ............................................................ 33
4.1 順向運動學（Forward Kinematics, FK） .................. 33
4.2 逆向運動學（Inverse Kinematics, IK） .................... 34
三、 研究方法 35
1 系統架構 .................................................................. 36
2 語義理解模組 ............................................................ 38
2.1 Segment Anything Model 2 .................................... 38
2.2 遮罩後處理 ...................................................... 42
2.3 LLaVA 物件語意辨識與過濾................................. 45
3 高層任務推理與規劃模組 ............................................. 47
3.1 模組概述 ......................................................... 47
3.2 知識支援：E5 向量檢索與 ATOMIC ....................... 47
3.3 推理引擎：Phi-4................................................ 51
3.4 任務拆解：Phi-4-mini-reasoning 模型...................... 52
3.5 動作序列格式化與執行模組 ................................. 54
3.6 成功的動作序列儲存與 Few-shot 提示增強............... 55
3.7 失敗的動作序列定位與重新規劃 ........................... 56
3.8 提示詞設計與任務拆解策略 ................................. 59
4 機器人行動控制與執行模組 .......................................... 61
4.1 系統整合總覽 ................................................... 61
4.2 平台基礎：NVIDIA Isaac SDK .............................. 61
4.3 機械手臂控制：Franka Panda ................................ 62
4.4 移動控制模組 ................................................... 65
4.5 模擬整合與狀態切換 .......................................... 67
四、 實驗設計與結果 70
1 實驗 UI 介面說明........................................................ 71
2 實驗一：物體搬運任務 ................................................ 74
2.1 流程說明與範例 ................................................ 74
2.2 執行流程與視覺化結果 ....................................... 76
2.3 語言模型任務拆解成功率比較 .............................. 78
3 實驗二：左右分類與堆疊任務 ....................................... 81
3.1 流程說明與範例 ................................................ 81
3.2 執行流程與視覺化結果 ....................................... 83
3.3 語言模型任務拆解成功率比較 .............................. 85
4 實驗三：堆疊 2D 金字塔 .............................................. 87
4.1 流程說明與範例 ................................................ 87
4.2 執行流程與視覺化結果 ....................................... 90
4.3 語言模型任務拆解成功率比較 .............................. 93
5 實驗四：跨模態語意推理與藥物選擇任務 ........................ 96
5.1 任務描述與目標 ................................................ 96
5.2 流程說明與範例 ................................................ 96
5.3 執行流程與視覺化結果 ....................................... 99
5.4 語言模型任務拆解成功率比較 ..............................101
5.5 應用延伸：居家語意推理助理場景模擬 ..................102
6 實驗五：物件標籤與 Bounding Box 生成 ..........................103
6.1 任務說明與目的 ................................................103
6.2 比較模型與設定 ................................................103
6.3 實驗結果與分析 ................................................104
五、 結論與未來展望 106
1 結論 ........................................................................106
2 未來展望 ..................................................................108
參考文獻 109

                                

[1] B. Siciliano and O. Khatib, Springer Handbook of Robotics(Springer Handbooks), 1st ed.
Springer Berlin, Heidelberg, 2008.
[2] M. A. Goodrich and A. C. Schultz, Human-Robot Interaction: A Survey. 2008.
[3] OpenAI, Gpt-4 technical report, 2024.
[4] G. Team, Gemini: A family of highly capable multimodal models, 2025.
[5] H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, 2023.
[6] R. Speer, J. Chin, and C. Havasi, Conceptnet 5.5: An open multilingual graph of general
knowledge, 2018.
[7] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L.
Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár,
and C. Feichtenhofer, Sam 2: Segment anything in images and videos, 2024.
[8] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J.
Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T.
Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R.
Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang, Phi-4 technical report, 2024.
[9] J. D. Hwang, C. Bhagavatula, R. L. Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi,
Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs, 2021.
[10] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and
D. Fox, “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday
Tasks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[11] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K.
Ehsani, D. Gordon, Y. Zhu, A. Kembhavi, A. Gupta, and A. Farhadi, Ai2-thor: An interactive 3d environment for visual ai, 2022.
[12] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, Chatgpt for robotics: Design principles and model abilities, 2023.
[13] 蔡時富, “使用大型語言模型進行機器控制指令的自動化生成,” 碩士論文, 國立中
央大學資訊工程學系, 2024.
[14] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.
[15] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, Llm-planner:
Few-shot grounded planning for embodied agents with large language models, 2023.
[16] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, Reflexion:
Language agents with verbal reinforcement learning, 2023.
[17] C. A. Chen, “Cognitive home social robot with generative action planning based on commonsense knowledge base and large language models,” M.S. thesis, National Taiwan
University, Jun. 2024.
[18] NVIDIA. “Nvidia isaac gr00t. ”[Online]. Available: https://developer.nvidia.com/isaac/
gr00t.
[19] NVIDIA, Gr00t n1: An open foundation model for generalist humanoid robots, 2025.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin, Attention is all you need, 2023.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[22] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, Squad: 100,000+ questions for machine comprehension of text, 2016.
[23] A. Williams, N. Nangia, and S. R. Bowman, A broad-coverage challenge corpus for
sentence understanding through inference, 2018.
[24] M. Munnangi, A brief history of named entity recognition, 2024.
[25] J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,”
IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, Jan.
2022.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.
[27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M.
Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, Language models are few-shot learners, 2020.
[28] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham,
H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez,
A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson,
R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya,
S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D.
Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S.
Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira,
R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta,
J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, Palm: Scaling
language modeling with pathways, 2022.
[29] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N.
Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, Llama:
Open and efficient foundation language models, 2023.
[30] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S.
Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A.
Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, Training language models to
follow instructions with human feedback, 2022.
[31] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V.
Le, Finetuned language models are zero-shot learners, 2022.
[32] D. Bergmann. “What is instruction tuning? ”[Online]. Available: https://www.ibm.com/
think/topics/instruction-tuning.
[33] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V.
Le, “Finetuned language models are zero-shot learners,” in International Conference on
Learning Representations, 2022.
[34] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, Large language models are
zero-shot reasoners, 2023.
[35] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B.
Hashimoto, Stanford alpaca: An instruction-following llama model, https://github.com/
tatsu-lab/stanford_alpaca, 2023.
[36] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum,
N. M. Duc, O. Stanley, R. Nagyfi, S. ES, S. Suri, D. Glushkov, A. Dantuluri, A. Maguire,
C. Schuhmann, H. Nguyen, and A. Mattick, Openassistant conversations – democratizing large language model alignment, 2023.
[37] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M.
Zaharia, and R. Xin. “Free dolly: Introducing the world’s first truly open instructiontuned llm,” Accessed: Jun. 30, 2023. [Online]. Available: https://www.databricks.com/
blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
[38] L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, Sharegpt4v:
Improving large multi-modal models with better captions, 2023.
[39] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik, Neural network distiller: A
python package for dnn compression research, 2019.
[40] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, 2015.
[41] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, Distilbert, a distilled version of bert:
Smaller, faster, cheaper and lighter, 2020.
[42] P. Zhang, G. Zeng, T. Wang, and W. Lu, Tinyllama: An open-source small language
model, 2024.
[43] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, Minigpt-4: Enhancing vision-language
understanding with advanced large language models, 2023.
[44] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning
of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
[45] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis,
W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, Retrieval-augmented generation for
knowledge-intensive nlp tasks, 2021.
[46] O. Ayala and P. Bechard, “Reducing hallucination in structured outputs via retrievalaugmented generation,” in Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), Y. Yang, A. Davani, A. Sil, and A. Kumar, Eds., Mexico
City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 228–238.
[47] Amazon. “什麼是 rag (檢索增強生成)？. ”[Online]. Available: https://aws.amazon.
com/tw/what-is/retrieval-augmented-generation/.
[48] OpenAI. “Clip: Connecting text and images. ”[Online]. Available: https://openai.com/
index/clip/.
[49] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes,
A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L.
Schmidt, R. Kaczmarczyk, and J. Jitsev, Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
[50] S. Jo, S. Ryu, S. Kim, E. Yang, and K. Kim, Ttd: Text-tag self-distillation enhancing
image-text alignment in clip to alleviate single tag bias, 2024.
[51] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, Conceptual 12m: Pushing web-scale
image-text pre-training to recognize long-tail visual concepts, 2021.
[52] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, realtime object detection, 2016.
[53] J. Redmon and A. Farhadi, Yolov3: An incremental improvement, 2018.
[54] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, Yolov4: Optimal speed and accuracy
of object detection, 2020.
[55] R. Khanam and M. Hussain, What is yolov5: A deep look into the internal features of the
popular object detector, 2024.
[56] M. Yaseen, What is yolov8: An in-depth exploration of the internal features of the nextgeneration object detector, 2024.
[57] OpenCV Contributors, Cv::trackercsrt class reference, 2025.
[58] S. Kucuk and Z. Bingul, “Robot kinematics: Forward and inverse kinematics,” in Industrial Robotics, S. Cubero, Ed., Rijeka: IntechOpen, 2006, ch. 4.
[59] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual e5 text
embeddings: A technical report,” arXiv preprint arXiv:2402.05672, 2024.
[60] M. Sap, R. LeBras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A.
Smith, and Y. Choi, Atomic: An atlas of machine commonsense for if-then reasoning,
2019.
[61] Y. Combinator, State-of-the-art prompting for ai agents, https://podcasts.apple.com/tw/
podcast/y-combinator-startup-podcast/id1236907421, 2025.
[62] F. Robotics. “Franka emika panda. ”[Online]. Available: https://www. chenlux. com/
shop/%E5%8D%94%E4%BD%9C%E5%9E%8B%E6%A9%9F%E6%A2%B0%E4%
BA%BA/panda/.

簡易檢索 / 詳目顯示

相關論文