| 研究生: |
費群安 Arda Satata Fitriajie |
|---|---|
| 論文名稱: |
以多特徵神經網路實現連續手語識別 Realizing Sign Language Recognition using Multi-Feature Neural Network |
| 指導教授: |
施國琛
Prof. Timothy K. Shih |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 圖像處理 、視頻處理 、連續手語識別 、手勢識別 、關鍵點 |
| 外文關鍵詞: | Image Processing, Video Processing, Continuous Sign Language Recognition, Gesture Recognition, Keypoint |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
若有 RGB 視頻串流,我們的目標是正確識別與連續手語識別 (CSLR) 相關的手語。儘管
該領域提出的深度學習方法逐漸增加,但大多數主要集中在僅使用 RGB 特徵,無論是全
幀圖像還是手部和臉部的細節。 CSLR 訓練過程信息的不足嚴重限制了他們學習視頻輸入
幀中多個特徵的能力。目前,多特徵網路變得相當普遍,因為當前的計算能力不再限制我
們擴大網路規模。因此,在本文中,我們將研究深度學習網路並應用多特徵技術,以期增
加和改進當前的連續手語識別任務,詳細說明我們將包括的另一個特徵在這項研究中,如
果我們將它們做比較,關鍵點特徵沒有圖像特徵那麼沉重。這項研究的結果表明,在
Phoenix2014 和中國手語這兩個最流行的 CSLR 數據集上,添加關鍵點特徵作為一種多特
徵模態可以提高識別率,或者通常會降低單詞錯誤率 (WER)。
Given the RGB video streams, we aim to recognize signs related to continuous sign
language recognition (CSLR) correctly. Despite there are increasing of proposed deep learning
methods in this area, most of them mainly focus on only using an RGB feature, either the fullframe image or the detail of hands and face. The scarcity of information for the CSLR training
process heavily constrains their capability to learn the multiple features within the video input
frames. Currently, Multi-feature networks became something quite common since the current
computing power is something that is not limiting us from scaling the network size anymore. Thus,
in this thesis, we’re going to work deep learning network and apply a multi-feature technique with
the hope to increase & improve the current state of the art of continuous sign language recognition
tasks, in detail another feature that we would include in this research is the key-point feature which
is not as heavy as the image feature if we are comparing them. The result of this research shows
that adding a key-point feature as a multi-feature modality could increase the recognition rate or
commonly, decrease the word error rate (WER) on the two most popular CSLR datasets:
Phoernix2014 and Chinese Sign Language.
REFERENCES
[1] K. Emmorey, “Language, Cognition, and the Brain,” Language, Cognition, and the Brain, Nov.
2001, doi: 10.4324/9781410603982/LANGUAGE-COGNITION-BRAIN-KAREN-EMMOREY.
[2] M. Mukushev, A. Sabyrov, A. Imashev, K. Koishybay, V. Kimmelman, and A. Sandygulova,
“Evaluation of Manual and Non-manual Components for Sign Language Recognition,”
Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6073–6078, 2020.
[3] N. B. Ibrahim, H. H. Zayed, and M. M. Selim, “Advances, Challenges and Opportunities in
Continuous Sign Language Recognition,” Journal of Engineering and Applied Sciences, vol. 15, no.
5, pp. 1205–1227, Dec. 2019, doi: 10.36478/JEASCI.2020.1205.1227.
[4] H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-Temporal Multi-Cue Network for Continuous Sign
Language Recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 768–779, 2022.
[5] K. Chen et al., “MMDetection: Open MMLab Detection Toolbox and Benchmark,” Jun. 2019, doi:
10.48550/arxiv.1906.07155.
[6] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton Aware Multi-modal Sign Language
Recognition,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3408–3418, Mar. 2021.
[7] A. Oya, “Vision based sign language recognition: modeling and recognizing isolated signs with
manual and non-manual components,” Bo˘gazi¸ci University, 2008.
[8] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona, “Large-scale Continuous Gesture
Recognition Using Convolutional Neural Networks,” Aug. 2016, doi: 10.48550/arxiv.1608.06338.
[9] J. Wan, S. Z. Li, Y. Zhao, S. Zhou, I. Guyon, and S. Escalera, “ChaLearn Looking at People RGB-D
Isolated and Continuous Datasets for Gesture Recognition,” IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, pp. 761–769, Dec. 2016.
[10] D. Li, C. Rodriguez Opazo, X. Yu, and H. Li, “Word-level Deep Sign Language Recognition from
Video: A New Large-scale Dataset and Methods Comparison,” Proceedings - 2020 IEEE Winter
Conference on Applications of Computer Vision, WACV 2020, pp. 1448–1458, Oct. 2019.
[11] O. M. Sincan and H. Y. Keles, “AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset
and Baseline Methods,” IEEE Access, vol. 8, pp. 181340–181355, Aug. 2020, doi:
10.1109/ACCESS.2020.3028072.
[12] J. Carreira, A. Zisserman, Z. Com, and † Deepmind, “Quo Vadis, Action Recognition? A New Model
and the Kinetics Dataset,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733, May 2017, doi:
10.48550/arxiv.1705.07750.
49
[13] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
[14] K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language
Processing, Proceedings of the Conference, pp. 1724–1734, 2014, doi: 10.3115/V1/D14-1179.
[15] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large
vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image
Understanding, vol. 141, pp. 108–125, Dec. 2015, doi: 10.1016/J.CVIU.2015.09.013.
[16] S. Jin et al., “Whole-Body Human Pose Estimation in the Wild,” Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
vol. 12354 LNCS, pp. 196–214, Jul. 2020, doi: 10.48550/arxiv.2007.11858.
[17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
with 3D Convolutional Networks,” Proceedings of the IEEE International Conference on Computer
Vision, vol. 2015 ICCV 2015, pp. 4489–4497, Dec. 2014.
[18] L. Song, X. Guo, and Y. Fan, “Action recognition in video using human keypoint detection,” 15th
International Conference on Computer Science and Education, ICCSE 2020, pp. 465–470, Aug.
2020, doi: 10.1109/ICCSE49874.2020.9201857.
[19] J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu, “JOLO-GCN: Mining Joint-Centered Light-Weight
Information for Skeleton-Based Action Recognition,” Proceedings - 2021 IEEE Winter Conference
on Applications of Computer Vision, WACV 2021, pp. 2734–2743, Nov. 2020.
[20] R. Yamashita, M. Nishio, R. Kinh, G. Do, and K. Togashi, “Convolutional neural networks: an
overview and application in radiology”, doi: 10.1007/s13244-018-0639-9.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi:
10.1109/5.726791.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical
image database,” pp. 248–255, Mar. 2010, doi: 10.1109/CVPR.2009.5206848.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-
December, pp. 770–778, Dec. 2015, doi: 10.48550/arxiv.1512.03385.
[24] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
[25] M. S. Islam, M. S. Sultana, U. K. Roy, and J. al Mahmud, “A review on Video Classification with
Methods, Findings, Performance, Challenges, Limitations and Future Work,” Jurnal Ilmiah Teknik
Elektro Komputer dan Informatika, vol. 6, no. 2, p. 47, Jan. 2021, doi:
10.26555/JITEKI.V6I2.18978.
50
[26] Y. Gao, “News Video Classification Model Based on ResNet-2 and Transfer Learning,” Security and
Communication Networks, vol. 2021, 2021, doi: 10.1155/2021/5865200.
[27] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A Critical Review of Recurrent Neural Networks for
Sequence Learning,” arXiv:1506.00019, May 2015.
[28] S. Siami-Namini, N. Tavakoli, and A. S. Namin, “The Performance of LSTM and BiLSTM in
Forecasting Time Series,” Proceedings - 2019 IEEE International Conference on Big Data, Big Data
2019, pp. 3285–3292, Dec. 2019, doi: 10.1109/BIGDATA47090.2019.9005997.
[29] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on
Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
[30] A. Hannun, “Sequence Modeling with CTC,” Distill, vol. 2, no. 11, p. e8, Nov. 2017, doi:
10.23915/DISTILL.00008.
[31] A. Graves, A. Ch, S. Fernández, F. Gomez, J. Schmidhuber, and J. Ch, “Connectionist Temporal
Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” ACM
International Conference Proceeding Series, vol. 148, pp. 369–376, 2006.
[32] J. Summaira, A. Muhammad Shoib, O. Bourahla, L. Songyuan, and J. Abdul, “Recent Advances and
Trends in Multimodal Deep Learning: A Review,” arXiv preprint, vol. 2105, no. 11087, 2021.
[33] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online Detection and
Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 2016-December, pp. 4207–4215, Dec. 2016, doi: 10.1109/CVPR.2016.456.
[34] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
[35] A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing
Systems, vol. 201-December, pp. 5999–6009, Jun. 2017.
[36] J. Pu, W. Zhou, and H. Li, “Iterative Alignment Network for Continuous Sign Language
Recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2019-June, pp. 4160–4169, Jun. 2019.
[37] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
[38] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for continuous sign language
recognition,” Proceedings - IEEE International Conference on Multimedia and Expo, vol. 2019-July,
pp. 1282–1287, Jul. 2019, doi: 10.1109/ICME.2019.00223.
[39] Y. Min, A. Hao, X. Chai, and X. Chen, “Visual Alignment Constraint for Continuous Sign Language
Recognition,” Proceedings of the IEEE International Conference on Computer Vision, pp. 11522–
11531, Apr. 2021.
51
[40] R. Takahashi, T. Matsubara, and K. Uehara, “Data Augmentation using Random Image Cropping
and Patching for Deep CNNs,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 30, no. 9, pp. 2917–2931, Nov. 2018.
[41] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” ECCV2020, vol. 8693 LNCS, no.
PART 5, pp. 740–755, May 2014.
[42] A. Sengupta, F. Jin, R. Zhang, and S. Cao, “mm-Pose: Real-Time Human Skeletal Posture
Estimation using mmWave Radars and CNNs,” IEEE Sensors Journal, vol. 20, no. 17, pp. 10032–
10044, Nov. 2019, doi: 10.1109/JSEN.2020.2991741.
[43] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep High-Resolution Representation Learning for Human
Pose Estimation,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2019-June, pp. 5686–5696, Feb. 2019.
[44] S. Chen, M. Zhang, X. Yang, Z. Zhao, T. Zou, and X. Sun, “The Impact of Attention Mechanisms on
Speech Emotion Recognition,” Sensors 2021, Vol. 21, Page 7530, vol. 21, no. 22, p. 7530, Nov.
2021, doi: 10.3390/S21227530.
[45] A. A. Baffour, Z. Qin, Y. Wang, Z. Qin, and K. K. R. Choo, “Spatial self-attention network with selfattention distillation for fine-grained image recognition,” Journal of Visual Communication and
Image Representation, vol. 81, p. 103368, Nov. 2021, doi: 10.1016/J.JVCIR.2021.103368.
[46] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
[47] D. Guo, S. Wang, Q. Tian, and M. Wang, “Dense temporal convolution network for sign language
translation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2019-August, pp.
744–750, 2019, doi: 10.24963/IJCAI.2019/105.
[48] S. Wang, D. Guo, W. G. Zhou, Z. J. Zha, and M. Wang, “Connectionist temporal fusion for sign
language translation,” MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, pp.
1483–1491, Oct. 2018, doi: 10.1145/3240508.3240671.
[49] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “SubUNets: End-to-End Hand Shape and
Continuous Sign Language Recognition,” Proceedings of the IEEE International Conference on
Computer Vision, vol. 2017-October, pp. 3075–3084, Dec. 2017, doi: 10.1109/ICCV.2017.332.
[50] D. Guo, W. Zhou, H. Li, and M. Wang, “Hierarchical LSTM for Sign Language Translation,”
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 6845–6852, Apr.
2018, doi: 10.1609/AAAI.V32I1.12235.
[51] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “SF-Net: Structured Feature Network for Continuous Sign
Language Recognition,” Aug. 2019, doi: 10.48550/arxiv.1908.01341.
[52] K. L. Cheng, Z. Yang, Q. Chen, and Y.-W. Tai, “Fully Convolutional Networks for Continuous Sign
Language Recognition,” ECCV, Jul. 2020, doi: 10.1007/978-3-030-58586-0_41.
52
[53] O. Koller, H. Ney, and R. Bowden, “Deep Hand: How to Train a CNN on 1 Million Hand Images
When Your Data Is Continuous and Weakly Labelled,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 3793–3802,
2016.
[54] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling Robust Statistical Continuous
Sign Language Recognition via Hybrid CNN-HMMs,” International Journal of Computer Vision, vol.
126, no. 12, pp. 1311–1325, Dec. 2018, doi: 10.1007/S11263-018-1121-3/TABLES/8.
[55] F. ben Slimane and M. Bouguessa, “Context Matters: Self-Attention for Sign Language
Recognition,” Proceedings - International Conference on Pattern Recognition, pp. 7884–7891, Jan.
2021, doi: 10.48550/arxiv.2101.04632.
[56] Z. Niu and B. Mak, “Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous
Sign Language Recognition,” ECCV 2020: Computer Vision – ECCV 2020, vol. 12361 LNCS, pp. 172–
186, 2020, doi: 10.1007/978-3-030-58517-4_11/TABLES/3.
[57] O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream
CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, Sep. 2020, doi:
10.1109/TPAMI.2019.2911077.
[58] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
[59] J. Pu, W. Zhou, H. Hu, and H. Li, “Boosting Continuous Sign Language Recognition via Cross
Modality Augmentation,” MM 2020 - Proceedings of the 28th ACM International Conference on
Multimedia, pp. 1497–1505, Oct. 2020, doi: 10.1145/3394171.3413931.
[60] R. Zuo and B. Mak, “C2SLR: Consistency-enhanced Continuous Sign Language Recognition,” CVPR
2022, pp. 5131–5140, 2022