跳到主要內容

簡易檢索 / 詳目顯示

研究生: 蔡政育
Cheng-Yu Tsai
論文名稱: 應用視覺語意圖於模擬環境中操作機器人日常生活任務
The Application of the Visual Semantics to Operate Robot Daily Tasks in a Simulated Environment
指導教授: 蘇木春
Mu-Chun Su
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 68
中文關鍵詞: 人工智慧智慧型機器人深度學習圖神經網路
外文關鍵詞: Artificial Intelligence, Smart Robots, Deep Learning, Graph Neural Networks
相關次數: 點閱:14下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,深度學習已被廣泛應用於機器人領域,所面臨的研究議題就屬機器人的視覺與語言的互動特別值得關注和需要突破。有許多的此類研究會用ALFRED (Action Learning From Realistic Environments and Directives) 當作效能評估指標,在此環境中,機器人需要依照所需執行的語言指令來執行日常室內家庭任務。本篇論文認為給予機器人視覺語意理解、語言語意理解,可使得推論能力得以提升。在本篇論文中,提出了一種新穎的方法-VSGM (Visual Semantic Graph Memory),利用語意圖的表示方式,能夠獲得較好的視覺影像特徵,提升機器人的視覺理解能力。藉由先驗知識與「場景圖生成網路」,轉換成圖表示方式,給予機器人;並將影像中的物件映射成由上而下以自我為中心的地圖 (Top-down Egocentric Map);最終藉由「圖神經網路」提取當前任務重要的物件特徵。本論文提出之方法,在ALFRED環境中進行驗證,在模型加入VSGM後,能夠提升任務成功率6~10 %。


    In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability. In our method, we propose a novel method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to obtain better visual image features, improve the robot's visual understanding ability. By providing prior knowledge of the robot and detecting the objects in the image, it predicts the correlation between the attributes of the object and the objects and converts them into a graph-based representation; and mapping the object in the image to be a top-down egocentric map. Finally, the important object features of the current task are extracted by Graph Neural Networks. Our proposed method is verified in the ALFRED (Action Learning From Realistic Environments and Directives) dataset. In this dataset, the robot needs to perform daily indoor household tasks following the required language instructions. After the model is added to the VSGM, the task success rate can be improved by 6~10 %.

    摘要 i ABSTRACT i 致謝 ii 目錄 iii 圖目錄 v 表目錄 vii 第一章、緒論 1 1-1 研究動機 1 1-2 研究目的 3 1-3 論文架構 4 第二章、相關研究 5 2-1 機器人相關研究 5 2-2 機器人指示任務 8 2-3 MOCA模型 14 2-4 視覺特徵處理方法 16 2-5 場景物件儲存方式 18 2-6 語意圖 (Semantic Graph) 20 第三章、研究方法 22 3-1 VSGM整體模型 22 3-2 模型輸入處理 24 3-3 語意圖 (Semantic Graph) 26 3-4 空間語意映射圖 (Spatial Semantic Maps) 32 3-4-1 透視相機 (The Perspective Camera) 32 3-4-2 語意映射方式 35 3-4-3 空間語意映射圖展示 37 3-5 圖神經網路 38 第四章、實驗設計與結果 39 4-1 ALFRED資料集 39 4-2 評估指標 41 4-3 實驗結果 43 4-3-1 Task Success, Sub-Goal Success Rate and PLW 43 4-3-2 Spatial Semantic Maps Ablation 46 4-3-3 Semantic Graph Ablation 49 第五章、結論與未來展望 50 5-1 結論 50 5-2 未來展望 50 參考文獻 51

    [1] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740-10749, 2020.
    [2] D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, “Mapping instructions to actions in 3d environments with visual goal prediction,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.
    [3] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
    [4] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” International Conference on 3D Vision, 2017.
    [5] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,” in International Conference on Learning Representations, 2020.
    [6] D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra, “Splitnet: Sim2sim and task2task transfer for embodied visual navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1022-1031, 2019.
    [7] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. V. D. Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674-3683, 2018.
    [8] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in European Conference on Computer Vision, pp. 259-274. Springer, 2020.
    [9] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. B.-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
    [10] X. Wang, V. Jain, E. Ie, W. Y. Wang, Z. Kozareva, and S. Ravi, “Environmentagnostic multitask learning for natural language grounded navigation,” in European Conference on Computer Vision, 2020.
    [11] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6629-6638, 2019.
    [12] H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” NAACL, 2019.
    [13] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer, “Vision-and-dialog navigation,” in Conference on Robot Learning, pp. 394-406. PMLR, 2020.
    [14] J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” in European Conference on Computer Vision, pp. 104-120. Springer, 2020.
    [15] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
    [16] Y. Lv, N. Xie, Y. Shi, Z. Wang, and H. T. Shen, “Improving target-driven visual navigation with attention on 3d spatial relationships,” arXiv preprint arXiv:2005.02153, 2020.
    [17] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” arXiv preprint arXiv:1810.06543, 2018.
    [18] T. Nguyen, D. Nguyen, and T. Le, “Reinforcement learning based navigation with semantic knowledge of indoor environments,” in 2019 11th International Conference on Knowledge and Systems Engineering (KSE), pp. 1-7. IEEE, 2019.
    [19] Y. Qiu, A. Pal, and H. I Christensen, “Learning hierarchical relationships for object-goal navigation,” in Conference on Robot Learning, 2020.
    [20] H. Du, X. Yu, and L. Zheng, “Learning object relation graph and tentative policy for visual navigation,” in European Conference on Computer Vision, pp. 19-34. Springer, 2020.
    [21] M. K. Moghaddam, Q. Wu, E. Abbasnejad, and J. Q. Shi, “Optimistic agent: Accurate graph-based value estimation for more successful visual navigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3733-3742, 2021.
    [22] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6750-6759, 2019.
    [23] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B Tenenbaum, “Look, listen, and act: Towards audio-visual embodied navigation,” in 2020 IEEE International Conference on Robotics and Automation, pp. 9701-9707. IEEE, 2020.
    [24] U. Jain, L. Weihs, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. Schwing, “A cordial sync: Going beyond marginal policies for multi-agent embodied tasks,” in European Conference on Computer Vision, pp. 471-490. Springer, 2020.
    [25] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” in International Conference on Learning Representations, 2021.
    [26] K. P. Singh, S. Bhambri, B. Kim, R. Mottaghi, and J. Choi, “Moca: A modular object-centric approach for interactive instruction following,” arXiv preprint arXiv:2012.03208, 2020.
    [27] H. Saha, F. Fotouhif, Q. Liu, and S. Sarkar, “A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment,” arXiv preprint arXiv:2101.07891, 2021.
    [28] S. Storks, Q. Gao, G. Thattai, and G. Tur, “Are we there yet? learning to localize in embodied instruction following,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
    [29] D. Xu, R. Martín-Martín, D. Huang, Y. Zhu, S. Savarese, and L. Fei-Fei, “Regression planning networks,” in Proceedings of the 33nd International Conference on Neural Information Processing Systems, 2019.
    [30] Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE international conference on robotics and automation, 2021.
    [31] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi, “Visual semantic planning using deep successor representations,” in Proceedings of the IEEE international conference on computer vision, pp. 483-492, 2017.
    [32] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, 9(8), pp. 1735-1780, 1997.
    [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
    [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
    [35] S. Ren, K He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, 28 , pp. 91-99, 2015.
    [36] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: an efficient subgraph-based framework for scene graph generation,” in European Conference on Computer Vision, pp. 335-351, 2018.
    [37] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716-3725, 2020.
    [38] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in European Conference on Computer Vision, pp. 670-685, 2018.
    [39] I. Armeni, Z. He, J.Y. Gwak, A. R Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5664-5673, 2019.
    [40] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” arXiv preprint arXiv:2002.06289, 2020.
    [41] Z. Liao, Y. Zhang, J. Luo, and W. Yuan, “Tsm: Topological scene map for representation in indoor environment understanding,” IEEE Access, 8, pp. 185870-185884, 2020.
    [42] U. Kim, J. Park, T. Song, and J. Kim, “3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents,” IEEE transactions on cybernetics, 50(12), pp. 4921-4933, 2019.
    [43] E. Beeching, J. Dibangoye, O. Simonin, and C. Wolf, “Learning to plan with uncertain topological maps,” arXiv preprint arXiv:2007.05270, 2020.
    [44] D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12875-12884, 2020.
    [45] K. Chen, J. K Chen, J. Chuang, M. Vázquez, and S. Savarese, “Topological planning with transformers for vision-and-language navigation,” arXiv preprint arXiv:2012.05292, 2020.
    [46] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, “Learning to explore using active neural slam,” in International Conference on Learning Representations, 2020.
    [47] C. Chen, S. Majumder, Z. Al-Halah, R. Gao, S. K. Ramakrishnan, and K. Grauman, “Learning to set waypoints for audio-visual navigation,” in International Conference on Learning Representations, 2021.
    [48] V. Cartillier, Z. Ren, N. Jain, S. Lee, I. Essa, and D. Batra, “Semantic mapnet: Building allocentric semanticmaps and representations from egocentric views,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
    [49] Z. Shen, L. Kästner, and J. Lambrecht, “Spatial imagination with semantic cognition for mobile robots,” arXiv preprint arXiv:2104.03638, 2021.
    [50] S. Wani, S. Patel, U. Jain, A. X Chang, and M. Savva, “Multion: Benchmarking semantic map memory using multi-object navigation,” in Proceedings of the 34nd International Conference on Neural Information Processing Systems, 2020.
    [51] Z. Seymour, K. Thopalli, N. Mithun, H. Chiu, S. Samarasekera, and R. Kumar, “Maast: Map attention with semantic transformersfor efficient visual navigation,” in 2021 IEEE international conference on robotics and automation, 2021.
    [52] T. N Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations, 2017.
    [53] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
    [54] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J Kim, “Graph transformer networks,” in Proceedings of the 33nd International Conference on Neural Information Processing Systems, 2019.
    [55] C. Zhang, D. Song, C. Huang, A. Swami, and N. V Chawla, “Heterogeneous graph neural network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793-803, 2019.
    [56] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, 5, pp. 135-146, 2017.

    QR CODE
    :::