| 研究生: |
蔡政育 Cheng-Yu Tsai |
|---|---|
| 論文名稱: |
應用視覺語意圖於模擬環境中操作機器人日常生活任務 The Application of the Visual Semantics to Operate Robot Daily Tasks in a Simulated Environment |
| 指導教授: |
蘇木春
Mu-Chun Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 68 |
| 中文關鍵詞: | 人工智慧 、智慧型機器人 、深度學習 、圖神經網路 |
| 外文關鍵詞: | Artificial Intelligence, Smart Robots, Deep Learning, Graph Neural Networks |
| 相關次數: | 點閱:14 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,深度學習已被廣泛應用於機器人領域,所面臨的研究議題就屬機器人的視覺與語言的互動特別值得關注和需要突破。有許多的此類研究會用ALFRED (Action Learning From Realistic Environments and Directives) 當作效能評估指標,在此環境中,機器人需要依照所需執行的語言指令來執行日常室內家庭任務。本篇論文認為給予機器人視覺語意理解、語言語意理解,可使得推論能力得以提升。在本篇論文中,提出了一種新穎的方法-VSGM (Visual Semantic Graph Memory),利用語意圖的表示方式,能夠獲得較好的視覺影像特徵,提升機器人的視覺理解能力。藉由先驗知識與「場景圖生成網路」,轉換成圖表示方式,給予機器人;並將影像中的物件映射成由上而下以自我為中心的地圖 (Top-down Egocentric Map);最終藉由「圖神經網路」提取當前任務重要的物件特徵。本論文提出之方法,在ALFRED環境中進行驗證,在模型加入VSGM後,能夠提升任務成功率6~10 %。
In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability. In our method, we propose a novel method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to obtain better visual image features, improve the robot's visual understanding ability. By providing prior knowledge of the robot and detecting the objects in the image, it predicts the correlation between the attributes of the object and the objects and converts them into a graph-based representation; and mapping the object in the image to be a top-down egocentric map. Finally, the important object features of the current task are extracted by Graph Neural Networks. Our proposed method is verified in the ALFRED (Action Learning From Realistic Environments and Directives) dataset. In this dataset, the robot needs to perform daily indoor household tasks following the required language instructions. After the model is added to the VSGM, the task success rate can be improved by 6~10 %.
[1] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740-10749, 2020.
[2] D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, “Mapping instructions to actions in 3d environments with visual goal prediction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.
[3] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
[4] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” International Conference on 3D Vision, 2017.
[5] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,” in International Conference on Learning Representations, 2020.
[6] D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra, “Splitnet: Sim2sim and task2task transfer for embodied visual navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1022-1031, 2019.
[7] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. V. D. Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674-3683, 2018.
[8] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in European Conference on Computer Vision, pp. 259-274. Springer, 2020.
[9] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. B.-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
[10] X. Wang, V. Jain, E. Ie, W. Y. Wang, Z. Kozareva, and S. Ravi, “Environmentagnostic multitask learning for natural language grounded navigation,” in European Conference on Computer Vision, 2020.
[11] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6629-6638, 2019.
[12] H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” NAACL, 2019.
[13] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer, “Vision-and-dialog navigation,” in Conference on Robot Learning, pp. 394-406. PMLR, 2020.
[14] J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” in European Conference on Computer Vision, pp. 104-120. Springer, 2020.
[15] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
[16] Y. Lv, N. Xie, Y. Shi, Z. Wang, and H. T. Shen, “Improving target-driven visual navigation with attention on 3d spatial relationships,” arXiv preprint arXiv:2005.02153, 2020.
[17] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” arXiv preprint arXiv:1810.06543, 2018.
[18] T. Nguyen, D. Nguyen, and T. Le, “Reinforcement learning based navigation with semantic knowledge of indoor environments,” in 2019 11th International Conference on Knowledge and Systems Engineering (KSE), pp. 1-7. IEEE, 2019.
[19] Y. Qiu, A. Pal, and H. I Christensen, “Learning hierarchical relationships for object-goal navigation,” in Conference on Robot Learning, 2020.
[20] H. Du, X. Yu, and L. Zheng, “Learning object relation graph and tentative policy for visual navigation,” in European Conference on Computer Vision, pp. 19-34. Springer, 2020.
[21] M. K. Moghaddam, Q. Wu, E. Abbasnejad, and J. Q. Shi, “Optimistic agent: Accurate graph-based value estimation for more successful visual navigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3733-3742, 2021.
[22] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6750-6759, 2019.
[23] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B Tenenbaum, “Look, listen, and act: Towards audio-visual embodied navigation,” in 2020 IEEE International Conference on Robotics and Automation, pp. 9701-9707. IEEE, 2020.
[24] U. Jain, L. Weihs, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. Schwing, “A cordial sync: Going beyond marginal policies for multi-agent embodied tasks,” in European Conference on Computer Vision, pp. 471-490. Springer, 2020.
[25] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” in International Conference on Learning Representations, 2021.
[26] K. P. Singh, S. Bhambri, B. Kim, R. Mottaghi, and J. Choi, “Moca: A modular object-centric approach for interactive instruction following,” arXiv preprint arXiv:2012.03208, 2020.
[27] H. Saha, F. Fotouhif, Q. Liu, and S. Sarkar, “A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment,” arXiv preprint arXiv:2101.07891, 2021.
[28] S. Storks, Q. Gao, G. Thattai, and G. Tur, “Are we there yet? learning to localize in embodied instruction following,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[29] D. Xu, R. Martín-Martín, D. Huang, Y. Zhu, S. Savarese, and L. Fei-Fei, “Regression planning networks,” in Proceedings of the 33nd International Conference on Neural Information Processing Systems, 2019.
[30] Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE international conference on robotics and automation, 2021.
[31] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi, “Visual semantic planning using deep successor representations,” in Proceedings of the IEEE international conference on computer vision, pp. 483-492, 2017.
[32] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, 9(8), pp. 1735-1780, 1997.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
[35] S. Ren, K He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, 28 , pp. 91-99, 2015.
[36] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: an efficient subgraph-based framework for scene graph generation,” in European Conference on Computer Vision, pp. 335-351, 2018.
[37] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716-3725, 2020.
[38] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in European Conference on Computer Vision, pp. 670-685, 2018.
[39] I. Armeni, Z. He, J.Y. Gwak, A. R Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5664-5673, 2019.
[40] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” arXiv preprint arXiv:2002.06289, 2020.
[41] Z. Liao, Y. Zhang, J. Luo, and W. Yuan, “Tsm: Topological scene map for representation in indoor environment understanding,” IEEE Access, 8, pp. 185870-185884, 2020.
[42] U. Kim, J. Park, T. Song, and J. Kim, “3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents,” IEEE transactions on cybernetics, 50(12), pp. 4921-4933, 2019.
[43] E. Beeching, J. Dibangoye, O. Simonin, and C. Wolf, “Learning to plan with uncertain topological maps,” arXiv preprint arXiv:2007.05270, 2020.
[44] D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12875-12884, 2020.
[45] K. Chen, J. K Chen, J. Chuang, M. Vázquez, and S. Savarese, “Topological planning with transformers for vision-and-language navigation,” arXiv preprint arXiv:2012.05292, 2020.
[46] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, “Learning to explore using active neural slam,” in International Conference on Learning Representations, 2020.
[47] C. Chen, S. Majumder, Z. Al-Halah, R. Gao, S. K. Ramakrishnan, and K. Grauman, “Learning to set waypoints for audio-visual navigation,” in International Conference on Learning Representations, 2021.
[48] V. Cartillier, Z. Ren, N. Jain, S. Lee, I. Essa, and D. Batra, “Semantic mapnet: Building allocentric semanticmaps and representations from egocentric views,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[49] Z. Shen, L. Kästner, and J. Lambrecht, “Spatial imagination with semantic cognition for mobile robots,” arXiv preprint arXiv:2104.03638, 2021.
[50] S. Wani, S. Patel, U. Jain, A. X Chang, and M. Savva, “Multion: Benchmarking semantic map memory using multi-object navigation,” in Proceedings of the 34nd International Conference on Neural Information Processing Systems, 2020.
[51] Z. Seymour, K. Thopalli, N. Mithun, H. Chiu, S. Samarasekera, and R. Kumar, “Maast: Map attention with semantic transformersfor efficient visual navigation,” in 2021 IEEE international conference on robotics and automation, 2021.
[52] T. N Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations, 2017.
[53] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
[54] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J Kim, “Graph transformer networks,” in Proceedings of the 33nd International Conference on Neural Information Processing Systems, 2019.
[55] C. Zhang, D. Song, C. Huang, A. Swami, and N. V Chawla, “Heterogeneous graph neural network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793-803, 2019.
[56] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, 5, pp. 135-146, 2017.