跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳昱任
Yu-Jen Chen
論文名稱: 適用於少量訓練資料之深度學習手語辨識輸入組合
Suitable Data Input for Deep-Learning-Based Sign Language Recognition with a Small Training Dataset
指導教授: 蘇柏齊
Po-Chyi Su
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 55
中文關鍵詞: 手語辨識特徵擷取深度學習
外文關鍵詞: Sign Language Recognition, Feature Extraction, Deep Learning
相關次數: 點閱:11下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基於深度學習的手語辨識通常需要大量視訊來訓練神經網路模型,
    本研究考量在手語視訊較不足的情況下,透過特徵擷取及擴大訓練資料
    等方式,產生有效的手語訓練資料以協助建構深度學習辨識模型。我們利
    用 Mediapipe 嘗試由手語視訊中取得手部骨架,分析幾種手部骨架調整策
    略以及顏色安排,並由骨架產生手部遮罩以模擬生成不同人的手部型態。
    由於手部偵測有時會因手指快速移動的動態模糊導致失誤,我們因此結
    合光流圖以確保每張畫面保留手部移動資訊。我們將手部骨架、手部型態
    以及畫面光流作為 3D-ResNet 模型的三個通道輸入,採用不同的空間域
    變化與時間域採樣策略,模擬不同大小的手、不同拍攝角度、不同手速等
    情形。實驗結果顯示我們所提出的方式於美國手語資料集中可以有效提
    高辨識準確度。

    關鍵字 - 手語辨識、特徵擷取、深度學習


    Deep learning-based sign language recognition usually requires a large
    number of sign language videos to train neural network models. In this study,
    we consider generating effective sign language training data to help construct
    deep learning recognition models through feature extraction and expansion of
    training data when a smaller number of sign language videos are used for
    training. We use Mediapipe to obtain the hand skeleton from the sign language
    video, analyze several hand skeleton adjustment policies and color arrangement,
    and generate hand masks from the skeleton to simulate hands of different
    persons. Since the miss detection of hands may happen due to the motion
    blurring caused by rapid hand movements, we incorporate optical flows to
    ensure that the hand movement information is retained in each frame. We use
    different spatial and temporal processing strategies to simulate different hand
    sizes, different filming angles, and different hand speeds. The experimental
    results show that the proposed approach is effective in improving the accuracy
    of sign language recognition in the American Sign Language dataset.

    Index Terms - Sign Language Recognition, Feature Extraction, Deep
    Learning

    目錄 摘要............................................................................................................I Abstract .................................................................................................... II 附圖目錄.................................................................................................VI 表格目錄.................................................................................................IX 第一章、 緒論......................................................................................... 1 1.1. 研究動機 ............................................................................... 1 1.2. 研究貢獻 ............................................................................... 2 1.3. 論文架構 ............................................................................... 2 第二章、 相關研究................................................................................. 3 2.1. 手語資料集 ........................................................................... 3 2.2. 手語辨識 ............................................................................... 5 2.2.1. 傳統方法......................................................................... 5 2.2.2. 深度學習模型................................................................. 6 2.3. 特徵擷取 ............................................................................. 11 2.3.1. 影像處理....................................................................... 11 2.3.2. 輔助工具....................................................................... 12 IV 2.3.3. 特徵擷取模型............................................................... 13 Detectron2......................................................................... 13 Mediapipe.......................................................................... 14 第三章、 提出方法............................................................................... 15 3.1. 資料集 ................................................................................. 15 3.1.1. 美國手語資料集........................................................... 15 3.1.2. 台灣手語資料集........................................................... 16 3.2. 訓練集合成 ......................................................................... 17 3.2.1. 手部特徵擷取............................................................... 17 3.2.2. 合成訓練資料............................................................... 19 3.2.3. 深度學習模型架構....................................................... 24 3D ResNet......................................................................... 24 3.3. 資料擴增 ............................................................................. 26 3.3.1. 空間域........................................................................... 26 3.3.2. 時間域........................................................................... 26 第四章、 實驗結果............................................................................... 28 4.1. 開發環境 ............................................................................. 28 V 4.2. 結果 ..................................................................................... 28 4.2.1. Baseline......................................................................... 28 4.2.2. 不同輸入組合比較....................................................... 29 4.2.3. Luminance of Mask + Skeleton + Optical Flow........... 30 4.2.4. 與不同的模型比較....................................................... 36 4.2.5. 台灣手語辨識............................................................... 37 第五章、 結論與未來展望................................................................... 38 5.1. 結論 ..................................................................................... 38 5.2. 未來展望 ............................................................................. 38 參考文獻................................................................................................. 39 VI 附圖目錄 圖 1 Kinect 輔助手語辨識[6] .......................................................... 5 圖 2 雙流網路架構[8]...................................................................... 6 圖 3 LRCN 架構[11]......................................................................... 7 圖 4 2D CNN+LSTM 雙流網路架構[12]........................................ 8 圖 5 2D Conv+GRU 架構[13].......................................................... 8 圖 6 3D ResNet[14] .......................................................................... 9 圖 7 I3D 架構[13]........................................................................... 10 圖 8 End-to-End RPAN 架構[18]................................................... 10 圖 9 SPOTER 架構[19].................................................................. 11 圖 10 傳統影像處理流程.............................................................. 12 圖 11 手語辨識輔助工具.............................................................. 12 圖 12 Detectron2 模塊[21] ............................................................. 13 圖 13 Mediapipe 模塊[22].............................................................. 14 圖 14 WLASL 美國手語資料集.................................................... 15 圖 15 台灣手語訓練集.................................................................. 16 VII 圖 16 台灣手語測試集.................................................................. 16 圖 17 產生手部邊界框.................................................................. 17 圖 18 產生手部骨架...................................................................... 18 圖 19 產生手部遮罩流程.............................................................. 18 圖 20 產生光流圖.......................................................................... 19 圖 21 訓練資料合成...................................................................... 19 圖 22 Blue+Red+Optical Flow ....................................................... 20 圖 23 Saturation+Value+Optical Flow............................................ 20 圖 24 Luminance of Mask+Skeleton+Optical Flow....................... 21 圖 25 Luminance of Mask+Skeleton+Optical Flow+Face Landmark ................................................................................................. 21 圖 26 骨架調整策略...................................................................... 24 圖 27 3D ResNet 殘差模塊............................................................ 24 圖 28 3D ResNet 網路架構............................................................ 25 圖 29 空間域資料增強.................................................................. 26 圖 30 不同的手部特徵.................................................................. 28 圖 31 手語視訊光流圖.................................................................. 30 VIII 圖 32 光流線段顏色調整 128-255 ............................................... 31 圖 33 光流線段顏色調整 16-255 ................................................. 32 圖 34 去除光流圖的點.................................................................. 32 圖 35 每個關節不同顏色.............................................................. 33 圖 36 同手指同個顏色.................................................................. 34 圖 37 同手指同顏色+指尖同顏色 ............................................... 34 圖 38 稍微調亮骨架顏色.............................................................. 35 IX 表格目錄 表 1 單詞手語資料集列表.............................................................. 3 表 2 連續手語資料集列表.............................................................. 4 表 3 單一特徵比較........................................................................ 29 表 4 不同輸入組合比較................................................................ 29 表 5 光流線段顏色調整實驗........................................................ 31 表 6 骨架顏色調整實驗................................................................ 33 表 7 手部遮罩調整實驗................................................................ 35 表 8 不同模型的比較.................................................................... 36 表 9 台灣手語實驗........................................................................ 37

    [1] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and
    its application to action recognition. In Proceedings of the 15th ACM
    international conference on Multimedia, pages 357–360. ACM, 2007.
    [2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning
    spatiotemporal features with 3d convolutional networks. In Proceedings of
    the IEEE international conference on computer vision, pages 4489–4497,
    2015.
    [3] F. Yasir, P. C. Prasad, A. Alsadoon, and A. Elchouemi. Sift based
    approach on bangla sign language recognition. In 2015. IEEE 8th
    International Workshop on Computational Intelligence and Applications
    (IWCIA), pages 35–39. IEEE, 2015.
    [4] M. Al-Rousan, K. Assaleh, and A. Talaa. Video-based signerindependent Arabic sign language recognition using hidden. markov
    models. Applied Soft Computing, 9(3):990– 999, 2009.
    [5] K. Simonyan and A. Zisserman. Two-stream convolutional networks
    for action recognition in Videos. In Advances in. neural information
    processing systems, ages 568–576, 2014.
    [6] M.W. Kadous et al., “Machine recognition of auslan signs using
    powergloves: Towards large-lexicon recognition of sign language,” in
    Proceedings of the Workshop on the Integration of Gesture in Language
    and Speech, vol. 165, 1996.
    40
    [7] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture
    recognition with kinect depth camera,” IEEE transactions on multimedia,
    vol. 17, no. 1, pp. 29–39, 2014.
    [8] K. Simonyan and A. Zisserman. Two-stream convolutional networks
    for action recognition in Videos. In Advances in. neural information
    processing systems, ages 568–576, 2014.
    [9] Khurram Soomro, Amir Roshan Zamir and Mubarak Shah. UCF101: A
    Dataset of 101 Human Actions Classes From. Videos in The Wild. In 2012
    CRCV-TR-12-01, 2012
    [10] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre. HMDB: A
    Large Video Database for Human Motion Recognition. In 2011 IEEE
    International Conference on Computer Vision, 2011.
    [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.
    Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent
    convolutional networks for visual recognition and description. In
    Proceedings of the IEEE conference on computer vision and pattern
    recognition, pages 2625–2634, 2015.
    [12] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R.
    Monga, and G. oderici. Beyond short snippets: Deep networks for Video
    classification. In Proceedings of the IEEE conference on computer vision
    and pattern recognition, pages 4694–4702, 2015.
    [13] Dongxu Li, Cristian Rodriguez Opazo, Xin Yu, Hongdong Li. Wordlevel Deep Sign Language Recognition from Video: A New Large-scale
    Dataset and Methods Comparison. In WACV, 2020.
    41
    [14] Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. Can Spatiotemporal
    3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR2018.
    [15] Kinectics:https://deepmind.com/research/open-source/kinetics
    [16] ImageNet: https://www.image-net.org/
    [17] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new
    model and the kinetics dataset. In CVPR, 2017.
    [18] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent poseattention network for action recognition in Videos. In Proceedings of the
    IEEE International Conference on Computer Vision, pages 3725–3734,
    2017.
    [19] Matya´s Boh ˇ a´cek, Marek Hr ˇ uz. Sign Pose-based Transformer for
    Word-level Sign Language Recognition. In WACV, 2022.
    [20] mmDetection. https://github.com/open-mmlab/mmdetection
    [21] Detectron2. https://github.com/facebookresearch/detectron2
    [22] Mediapipe. https://mediapipe.dev
    [23] MaskRCNN-benchmark.
    https://github.com/facebookresearch/maskrcnn-benchmark
    [24] Ross Girshick. Fast R-CNN. In Microsoft Research.
    [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards Real-Time Object Detection with Region Proposal
    Networks. In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
    42
    MACHINE INTELLIGENCE, VOL. 39, NO. 6, 2017.
    [26] Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick. In
    Facebook AI Research
    [27] Detectron. https://github.com/facebookresearch/Detectron
    [28] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram,
    Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto.
    How2Sign: A Large-scale Multimodal Dataset for Continuous American
    Sign Language. In CVPR, 2021.
    [29] Anirudh Tunga, Sai Vidyaranya Nuthalapati, Juan Wachs. Pose-based
    Sign Language Recognition using GCN and BERT. In WACV, 2021

    QR CODE
    :::