跳到主要內容

簡易檢索 / 詳目顯示

研究生: 蔡允齊
Yun-Chi Tsai
論文名稱: 擷取有效畫面域與時間域資訊進行深度學習手語辨識
Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction
指導教授: 蘇柏齊
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 53
中文關鍵詞: 手語辨識關鍵幀深度學習
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基於深度學習的自動手語辨識需要大量視訊資料進行模型訓練,然而手語視訊的製作與蒐集相當費時繁瑣,少量或不夠多樣的資料集則限制了手語辨識模型的準確率。本研究針對手語辨識提出有效的空間域與時間域資料擷取方法,希望將有限的手語視訊資料透過合理的擴增處理產生更大量與多樣的訓練資料,這些做為深度學習網路的輸入資料可搭配較簡易的架構如3D-ResNet來搭建,可以不採用複雜或需要大量訓練資源的網路架構即可獲致相當的手語辨識效果。我們的空間域資料擷取採用以Mediapipe所取得的骨架、手部區域型態或遮罩,以及移動光流,這三種資料可做為像是較早的3D-ResNet模型所常採用的三通道輸入,但與以往RGB輸入不同的是我們的三種資料各有特點而讓特徵擷取更具效果。時間域資料擷取則透過計算與決定關鍵幀的方式挑選更有意義畫面,藉此達成不同的畫面選擇策略。我們所提出的時間域與空間域資料可再用有效的資料增強模擬多種手尺寸、手勢速度、拍攝角度等,對於擴充資料集與增加多樣性都有很大的助益。實驗結果顯示我們的方法對於常用的美國手語資料集有顯著的辨識準確度提升。


    Automatic sign language recognition based on deep learning requires a large amount of video data for model training. However, the creation and collection of sign language videos are time-consuming and tedious processes. Limited or insufficiently diverse datasets restrict the accuracy of sign language recognition models. In this study, we propose effective spatial and temporal data extraction methods for sign language recognition. The goal is to augment the limited sign language video data to generate a larger and more diverse training dataset. The augmented data, used as inputs to deep learning networks, can be paired with simpler architectures like 3D-ResNet, which allows for achieving considerable sign language recognition performance without the need for complex or resource-intensive network structures.
    Our spatial data extraction employs three types of data: skeletons obtained using Mediapipe, hand region patterns or masks, and optical flows. These three data types can be used as three-channel inputs, akin to the approach often used in earlier 3D-ResNet models. Nevertheless, our distinct data types offer specific features that enhance feature extraction. For temporal data extraction, we determine certain key-frames to capture more meaningful visual information, thus employing different scene selection strategies.
    The proposed spatial and temporal data extraction methods facilitate data augmentation, which simulates various hand sizes, gesture speeds, shooting angles, etc. The strategy significantly contributes to expanding the dataset and increasing its diversity. Experimental results demonstrate that our approach significantly improves the recognition accuracy for commonly used American Sign Language datasets.

    論文摘要 i Abstract ii 目錄 iv 附圖目錄 vii 表格目錄 ix 一、 緒論 1 1-1 研究背景與動機 1 1-2 研究貢獻 2 1-3 論文架構 3 二、 相關研究 4 2-1 手語辨識 4 2-2 美國手語資料集 4 2-3 相關方法 5 2-3-1 傳統方法 5 2-3-2 深度學習 6 三、 方法 13 3-1 預處理 13 3-1-1 RGB to LSO 13 3-1-2 MediaPipe的錯誤偵測 13 3-1-3 臉部特徵(face landmark) 15 3-2 模型架構 19 3-3 資料增強 20 3-3-1 空間域 20 3-3-2 時間域 21 四、 實驗結果 29 4-1 實驗環境 29 4-2 實驗結果 29 4-2-1 Baseline 29 4-2-2 整理資料集 30 4-2-3 Face landmark 31 4-2-4 時間域測試 31 五、 結論與未來展望 36 5-1 結論 36 5-2 未來展望 36 六、 參考文獻 37

    [1] Y.-J. Chen, "Suitable Data Input for Deep-Learning-Based Sign Language Recognition with a Small Training Dataset," National Central University,CSIE,2022.[Online].Available:https://hdl.handle.net/11296/4ybeup.
    [2] D. Li, C. Rodriguez, X. Yu, and H. Li, "Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison," in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1459-1469.
    [3] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014.
    [4] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.
    [5] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database for human motion recognition," in 2011 International conference on computer vision, 2011: IEEE, pp. 2556-2563.
    [6] H. Luqman, "An Efficient Two-Stream Network for Isolated Sign Language Recognition Using Accumulative Video Motion," IEEE Access, vol. 10, pp. 93785-93798, 2022.
    [7] A. A. I. Sidig, H. Luqman, S. Mahmoud, and M. Mohandes, "KArSL: Arabic Sign Language Database," ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 20, no. 1, p. Article 14, 2021, doi: 10.1145/3423420.
    [8] J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.
    [9] L. Hu, L. Gao, and W. Feng, "Self-Emphasizing Network for Continuous Sign Language Recognition," arXiv preprint arXiv:2211.17081, 2022.
    [10] J. Wang et al., "Deep high-resolution representation learning for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349-3364, 2020.
    [11] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, "Neural sign language translation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7784-7793.
    [12] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, "Improving sign language translation with monolingual data by sign back-translation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1316-1325.
    [13] K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546-6555.
    [14] L. Smaira, J. Carreira, E. Noland, E. Clancy, A. Wu, and A. Zisserman, "A short note on the kinetics-700-2020 human action dataset," arXiv preprint arXiv:2010.10864, 2020.
    [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255.
    [16] W. Du, Y. Wang, and Y. Qiao, "Rpan: An end-to-end recurrent pose-attention network for action recognition in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3725-3734.
    [17] M. Boháček and M. Hrúz, "Sign pose-based transformer for word-level sign language recognition," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 182-191.
    [18] Z. Zhou, V. W. Tam, and E. Y. Lam, "SIGNBERT: a Bert-based deep learning framework for continuous sign language recognition," IEEE Access, vol. 9, pp. 161669-161682, 2021.
    [19] Z. Liu et al., "Video swin transformer," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202-3211.

    QR CODE
    :::