| 研究生: |
陳昱任 Yu-Jen Chen |
|---|---|
| 論文名稱: |
適用於少量訓練資料之深度學習手語辨識輸入組合 Suitable Data Input for Deep-Learning-Based Sign Language Recognition with a Small Training Dataset |
| 指導教授: |
蘇柏齊
Po-Chyi Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 手語辨識 、特徵擷取 、深度學習 |
| 外文關鍵詞: | Sign Language Recognition, Feature Extraction, Deep Learning |
| 相關次數: | 點閱:10 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基於深度學習的手語辨識通常需要大量視訊來訓練神經網路模型,
本研究考量在手語視訊較不足的情況下,透過特徵擷取及擴大訓練資料
等方式,產生有效的手語訓練資料以協助建構深度學習辨識模型。我們利
用 Mediapipe 嘗試由手語視訊中取得手部骨架,分析幾種手部骨架調整策
略以及顏色安排,並由骨架產生手部遮罩以模擬生成不同人的手部型態。
由於手部偵測有時會因手指快速移動的動態模糊導致失誤,我們因此結
合光流圖以確保每張畫面保留手部移動資訊。我們將手部骨架、手部型態
以及畫面光流作為 3D-ResNet 模型的三個通道輸入,採用不同的空間域
變化與時間域採樣策略,模擬不同大小的手、不同拍攝角度、不同手速等
情形。實驗結果顯示我們所提出的方式於美國手語資料集中可以有效提
高辨識準確度。
關鍵字 - 手語辨識、特徵擷取、深度學習
Deep learning-based sign language recognition usually requires a large
number of sign language videos to train neural network models. In this study,
we consider generating effective sign language training data to help construct
deep learning recognition models through feature extraction and expansion of
training data when a smaller number of sign language videos are used for
training. We use Mediapipe to obtain the hand skeleton from the sign language
video, analyze several hand skeleton adjustment policies and color arrangement,
and generate hand masks from the skeleton to simulate hands of different
persons. Since the miss detection of hands may happen due to the motion
blurring caused by rapid hand movements, we incorporate optical flows to
ensure that the hand movement information is retained in each frame. We use
different spatial and temporal processing strategies to simulate different hand
sizes, different filming angles, and different hand speeds. The experimental
results show that the proposed approach is effective in improving the accuracy
of sign language recognition in the American Sign Language dataset.
Index Terms - Sign Language Recognition, Feature Extraction, Deep
Learning
[1] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and
its application to action recognition. In Proceedings of the 15th ACM
international conference on Multimedia, pages 357–360. ACM, 2007.
[2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning
spatiotemporal features with 3d convolutional networks. In Proceedings of
the IEEE international conference on computer vision, pages 4489–4497,
2015.
[3] F. Yasir, P. C. Prasad, A. Alsadoon, and A. Elchouemi. Sift based
approach on bangla sign language recognition. In 2015. IEEE 8th
International Workshop on Computational Intelligence and Applications
(IWCIA), pages 35–39. IEEE, 2015.
[4] M. Al-Rousan, K. Assaleh, and A. Talaa. Video-based signerindependent Arabic sign language recognition using hidden. markov
models. Applied Soft Computing, 9(3):990– 999, 2009.
[5] K. Simonyan and A. Zisserman. Two-stream convolutional networks
for action recognition in Videos. In Advances in. neural information
processing systems, ages 568–576, 2014.
[6] M.W. Kadous et al., “Machine recognition of auslan signs using
powergloves: Towards large-lexicon recognition of sign language,” in
Proceedings of the Workshop on the Integration of Gesture in Language
and Speech, vol. 165, 1996.
40
[7] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture
recognition with kinect depth camera,” IEEE transactions on multimedia,
vol. 17, no. 1, pp. 29–39, 2014.
[8] K. Simonyan and A. Zisserman. Two-stream convolutional networks
for action recognition in Videos. In Advances in. neural information
processing systems, ages 568–576, 2014.
[9] Khurram Soomro, Amir Roshan Zamir and Mubarak Shah. UCF101: A
Dataset of 101 Human Actions Classes From. Videos in The Wild. In 2012
CRCV-TR-12-01, 2012
[10] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre. HMDB: A
Large Video Database for Human Motion Recognition. In 2011 IEEE
International Conference on Computer Vision, 2011.
[11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent
convolutional networks for visual recognition and description. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2625–2634, 2015.
[12] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R.
Monga, and G. oderici. Beyond short snippets: Deep networks for Video
classification. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 4694–4702, 2015.
[13] Dongxu Li, Cristian Rodriguez Opazo, Xin Yu, Hongdong Li. Wordlevel Deep Sign Language Recognition from Video: A New Large-scale
Dataset and Methods Comparison. In WACV, 2020.
41
[14] Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. Can Spatiotemporal
3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR2018.
[15] Kinectics:https://deepmind.com/research/open-source/kinetics
[16] ImageNet: https://www.image-net.org/
[17] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new
model and the kinetics dataset. In CVPR, 2017.
[18] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent poseattention network for action recognition in Videos. In Proceedings of the
IEEE International Conference on Computer Vision, pages 3725–3734,
2017.
[19] Matya´s Boh ˇ a´cek, Marek Hr ˇ uz. Sign Pose-based Transformer for
Word-level Sign Language Recognition. In WACV, 2022.
[20] mmDetection. https://github.com/open-mmlab/mmdetection
[21] Detectron2. https://github.com/facebookresearch/detectron2
[22] Mediapipe. https://mediapipe.dev
[23] MaskRCNN-benchmark.
https://github.com/facebookresearch/maskrcnn-benchmark
[24] Ross Girshick. Fast R-CNN. In Microsoft Research.
[25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards Real-Time Object Detection with Region Proposal
Networks. In IEEE TRANSACTIONS ON PATTERN ANALYSIS AND
42
MACHINE INTELLIGENCE, VOL. 39, NO. 6, 2017.
[26] Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick. In
Facebook AI Research
[27] Detectron. https://github.com/facebookresearch/Detectron
[28] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram,
Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto.
How2Sign: A Large-scale Multimodal Dataset for Continuous American
Sign Language. In CVPR, 2021.
[29] Anirudh Tunga, Sai Vidyaranya Nuthalapati, Juan Wachs. Pose-based
Sign Language Recognition using GCN and BERT. In WACV, 2021