跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳姿妤
Tzu-Yu Chen
論文名稱: 使用YMCL模型改善使用者意圖分類成效
Improve User Intent Classification by Incorporating Visual Context Using YMCL Model
指導教授: 蔡宗翰
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 45
中文關鍵詞: 對話系統意圖分類任務多模態
外文關鍵詞: Dialogue system, Intent classification task, Multimodality
相關次數: 點閱:15下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這份研究中,我們設計了一套互動式對話系統,用以協助使用者完成一項機器人組裝任務。此對話系統會針對使用者在組裝過程中遇到的問題給出解決方法。我們將使用者的問題映射到最相近的預定義常見問題(FAQ),以此做為使用者意圖。接下來系統會根據使用者意圖給出相對應的回答。
    一般狀況下,有著相同使用者意圖的問題大多都能夠以相似的回答來解決。然而,在我們的組裝任務上,既使是同樣的問題,在不同的組裝步驟中被提出,也應該有不同的回應。我們對話系統中的使用者意圖分類器在只有使用者問句的情況下只能達到68.95%的準確率。為了解決這個問題,我們在原來系統中的使用者意圖分類器上加入了Yolo-based Masker with CNN-LSTM (YMCL)模型。透過合併影像資訊,在不同資料集的實驗結果上可以看到大幅度的準確率提升。


    In this research, we design an interactive dialogue system which aims at helping user complete the robot assembly task.The system would provide solution to the user question when the user encounters problems during the assembly process.We map the user question to the most related pre-defined frequently asked question (FAQ) as the user intent.The system will then give out the answer according to the detected user intent.
    In general case, user questions with the same user intent can mostly be solved with similar answers. However, in our assembly task, even the same user question asked in different assembly step should lead to different response.With only user question utterance, our user intent classifier achieves accuracy of 68.95%. To solve this problem, we integrate the proposed Yolo-based Masker with CNN-LSTM (YMCL) model into the user intent classifier in our dialogue system.By incorporating visual information, a significant improvement can be observed from the experiments conducted on different dataset.

    中文摘要 i Abstract ii 誌謝 iii Contents iv List of Figures vi List of Tables vii 1 Introduction 1 2 Related Work 4 2.1 Dialogue System 4 2.2 Visual and Video Question Answering 5 2.3 Object Detection 6 3 Method 7 3.1 Dataset 7 3.1.1 Data collection 7 3.1.2 Multimodal Dataset 12 3.2 YMCL Model 13 3.3 Multimodal Intent Classification Model 14 4 Experiment and Result 15 4.1 Without-Video Intent Classification 15 4.1.1 Training Dataset 15 4.1.2 Max Sequence Length of Input of BERT Model 16 4.1.3 Sentence Representation Method 16 4.2 Mulitmodal Intent Classification 17 4.3 Error Analysis 19 4.4 Analysis on Visual Context Capture Method 25 4.5 Analysis on Different Test Data 26 5 Conclusion 28 Bibliography 30

    [1] D. Dougherty, “The maker movement,” Innovations: Technology, governance, globalization, vol. 7, no. 3, pp. 11–14, 2012.
    [2] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” Acm Sigkdd Explorations Newsletter, vol. 19, no. 2, pp.
    25–35, 2017.
    [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
    [4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2612–2620.
    [5] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networks for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4709–4717.
    [6] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.6639–6648.
    [7] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision, vol. 124, no. 3, pp. 409–421, 2017.
    [8] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings
    of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4631–4640.
    [9] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, 2017, pp. 2758–2766.
    [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
    [11] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
    [12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
    [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
    [14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
    [15] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
    [16] J. Redmon and A. Farhadi, “Yolov3: An incremental Improvement,” arXiv preprint arXiv:1804.02767, 2018.
    [17] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
    [18] N. Dahlbäck, A. Jönsson, and L. Ahrenberg, “Wizard of oz studies: why and how,” in Proceedings of the 1st international conference on Intelligent user interfaces, 1993, pp. 193–200.
    [19] S. Tauroza and D. Allison, “Speech rates in british english,” Applied linguistics, vol. 11, no. 1, pp. 90–105, 1990.
    [20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, 2018.
    [21] B.-H. Chang, “Incorporate multi-modal context for improving user intent classification work,” Master’s thesis, National Central University, 2019.
    [22] M. Popović and H. Ney, “Word error rates: Decomposition over POS classes and applications for error analysis,” in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 48–55.

    QR CODE
    :::