跳到主要內容

簡易檢索 / 詳目顯示

研究生: 費群安
Arda Satata Fitriajie
論文名稱: 以多特徵神經網路實現連續手語識別
Realizing Sign Language Recognition using Multi-Feature Neural Network
指導教授: 施國琛
Prof. Timothy K. Shih
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 63
中文關鍵詞: 圖像處理視頻處理連續手語識別手勢識別關鍵點
外文關鍵詞: Image Processing, Video Processing, Continuous Sign Language Recognition, Gesture Recognition, Keypoint
相關次數: 點閱:13下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 若有 RGB 視頻串流,我們的目標是正確識別與連續手語識別 (CSLR) 相關的手語。儘管
    該領域提出的深度學習方法逐漸增加,但大多數主要集中在僅使用 RGB 特徵,無論是全
    幀圖像還是手部和臉部的細節。 CSLR 訓練過程信息的不足嚴重限制了他們學習視頻輸入
    幀中多個特徵的能力。目前,多特徵網路變得相當普遍,因為當前的計算能力不再限制我
    們擴大網路規模。因此,在本文中,我們將研究深度學習網路並應用多特徵技術,以期增
    加和改進當前的連續手語識別任務,詳細說明我們將包括的另一個特徵在這項研究中,如
    果我們將它們做比較,關鍵點特徵沒有圖像特徵那麼沉重。這項研究的結果表明,在
    Phoenix2014 和中國手語這兩個最流行的 CSLR 數據集上,添加關鍵點特徵作為一種多特
    徵模態可以提高識別率,或者通常會降低單詞錯誤率 (WER)。


    Given the RGB video streams, we aim to recognize signs related to continuous sign
    language recognition (CSLR) correctly. Despite there are increasing of proposed deep learning
    methods in this area, most of them mainly focus on only using an RGB feature, either the fullframe image or the detail of hands and face. The scarcity of information for the CSLR training
    process heavily constrains their capability to learn the multiple features within the video input
    frames. Currently, Multi-feature networks became something quite common since the current
    computing power is something that is not limiting us from scaling the network size anymore. Thus,
    in this thesis, we’re going to work deep learning network and apply a multi-feature technique with
    the hope to increase & improve the current state of the art of continuous sign language recognition
    tasks, in detail another feature that we would include in this research is the key-point feature which
    is not as heavy as the image feature if we are comparing them. The result of this research shows
    that adding a key-point feature as a multi-feature modality could increase the recognition rate or
    commonly, decrease the word error rate (WER) on the two most popular CSLR datasets:
    Phoernix2014 and Chinese Sign Language.

    LIST OF CONTENT Abstract........................................................................................................................................................iv 摘要 ..............................................................................................................................................................v List of Content .............................................................................................................................................vi List of Figure..............................................................................................................................................viii List of Tables...............................................................................................................................................ix Chapter 1. Introduction .................................................................................................................................1 1.1 General Introduction....................................................................................................................1 1.2 Objective of Research..................................................................................................................2 1.3 Scope of the Study.......................................................................................................................3 1.4 Thesis Outline..............................................................................................................................3 Chapter 2. Literature Review........................................................................................................................5 2.1 Isolated Sign Language Recognition...........................................................................................5 2.2 Continuous Sign Language Recognition .....................................................................................6 2.3 Keypoint-based Action Recognition ...........................................................................................8 2.4 Convolutional Neural Network ...................................................................................................9 2.5 Bidirectional LSTM Networks..................................................................................................11 2.6 Connectionist temporal classification........................................................................................13 2.7 Multi-features Approach............................................................................................................15 2.8 Self-Attention ............................................................................................................................16 Chapter 3. Research Method.......................................................................................................................17 3.1 Framework Overview................................................................................................................17 3.2 Dataset.......................................................................................................................................18 3.2.1 Phoenix2014..............................................................................................................................18 3.2.2 Chinese Sign Language (CSL-100)...........................................................................................19 3.3 Data Pre-processing...................................................................................................................21 3.3.1 Data Augmentation....................................................................................................................21 3.3.1.1 Random Crop ........................................................................................................................22 3.3.1.2 Horizontal Flip ......................................................................................................................22 3.3.1.3 Random Temporal Scaling....................................................................................................23 3.3.2 Key-point Extraction .................................................................................................................24 3.4 Spatial Module...........................................................................................................................26 3.4.1 Full Frame Feature ....................................................................................................................27 vii 3.4.2 Keypoint Feature .......................................................................................................................28 3.5 Temporal Module ......................................................................................................................29 3.6 Sequence Learning ....................................................................................................................30 3.7 Evaluation Metric ......................................................................................................................31 3.8 Loss Function ............................................................................................................................32 3.9 Self-Attention ............................................................................................................................34 3.9.1 Spatial Attention........................................................................................................................35 3.9.2 Early Temporal Attention..........................................................................................................36 3.9.3 Proposed Late Temporal Attention............................................................................................37 Chapter 4. Experiment Result & Discussion ..............................................................................................38 4.1 Experiment Settings...................................................................................................................38 4.2 Experiment on Input Streams.....................................................................................................39 4.3 Experiment on Attention Module...............................................................................................40 4.4 Experiment on Proposed Model.................................................................................................41 4.4.1 Quantitative Result.....................................................................................................................41 4.4.2 Qualitative Result ......................................................................................................................43 Chapter 5. Conclusion and Discussion .......................................................................................................46 5.1 Conclusion.................................................................................................................................46 5.2 Discussion & Future Works.......................................................................................................46 References...................................................................................................................................................48

    REFERENCES
    [1] K. Emmorey, “Language, Cognition, and the Brain,” Language, Cognition, and the Brain, Nov.
    2001, doi: 10.4324/9781410603982/LANGUAGE-COGNITION-BRAIN-KAREN-EMMOREY.
    [2] M. Mukushev, A. Sabyrov, A. Imashev, K. Koishybay, V. Kimmelman, and A. Sandygulova,
    “Evaluation of Manual and Non-manual Components for Sign Language Recognition,”
    Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6073–6078, 2020.
    [3] N. B. Ibrahim, H. H. Zayed, and M. M. Selim, “Advances, Challenges and Opportunities in
    Continuous Sign Language Recognition,” Journal of Engineering and Applied Sciences, vol. 15, no.
    5, pp. 1205–1227, Dec. 2019, doi: 10.36478/JEASCI.2020.1205.1227.
    [4] H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-Temporal Multi-Cue Network for Continuous Sign
    Language Recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 768–779, 2022.
    [5] K. Chen et al., “MMDetection: Open MMLab Detection Toolbox and Benchmark,” Jun. 2019, doi:
    10.48550/arxiv.1906.07155.
    [6] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton Aware Multi-modal Sign Language
    Recognition,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    Workshops, pp. 3408–3418, Mar. 2021.
    [7] A. Oya, “Vision based sign language recognition: modeling and recognizing isolated signs with
    manual and non-manual components,” Bo˘gazi¸ci University, 2008.
    [8] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona, “Large-scale Continuous Gesture
    Recognition Using Convolutional Neural Networks,” Aug. 2016, doi: 10.48550/arxiv.1608.06338.
    [9] J. Wan, S. Z. Li, Y. Zhao, S. Zhou, I. Guyon, and S. Escalera, “ChaLearn Looking at People RGB-D
    Isolated and Continuous Datasets for Gesture Recognition,” IEEE Computer Society Conference on
    Computer Vision and Pattern Recognition Workshops, pp. 761–769, Dec. 2016.
    [10] D. Li, C. Rodriguez Opazo, X. Yu, and H. Li, “Word-level Deep Sign Language Recognition from
    Video: A New Large-scale Dataset and Methods Comparison,” Proceedings - 2020 IEEE Winter
    Conference on Applications of Computer Vision, WACV 2020, pp. 1448–1458, Oct. 2019.
    [11] O. M. Sincan and H. Y. Keles, “AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset
    and Baseline Methods,” IEEE Access, vol. 8, pp. 181340–181355, Aug. 2020, doi:
    10.1109/ACCESS.2020.3028072.
    [12] J. Carreira, A. Zisserman, Z. Com, and † Deepmind, “Quo Vadis, Action Recognition? A New Model
    and the Kinetics Dataset,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern
    Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733, May 2017, doi:
    10.48550/arxiv.1705.07750.
    49
    [13] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
    Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
    Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
    [14] K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
    Machine Translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language
    Processing, Proceedings of the Conference, pp. 1724–1734, 2014, doi: 10.3115/V1/D14-1179.
    [15] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large
    vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image
    Understanding, vol. 141, pp. 108–125, Dec. 2015, doi: 10.1016/J.CVIU.2015.09.013.
    [16] S. Jin et al., “Whole-Body Human Pose Estimation in the Wild,” Lecture Notes in Computer Science
    (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
    vol. 12354 LNCS, pp. 196–214, Jul. 2020, doi: 10.48550/arxiv.2007.11858.
    [17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
    with 3D Convolutional Networks,” Proceedings of the IEEE International Conference on Computer
    Vision, vol. 2015 ICCV 2015, pp. 4489–4497, Dec. 2014.
    [18] L. Song, X. Guo, and Y. Fan, “Action recognition in video using human keypoint detection,” 15th
    International Conference on Computer Science and Education, ICCSE 2020, pp. 465–470, Aug.
    2020, doi: 10.1109/ICCSE49874.2020.9201857.
    [19] J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu, “JOLO-GCN: Mining Joint-Centered Light-Weight
    Information for Skeleton-Based Action Recognition,” Proceedings - 2021 IEEE Winter Conference
    on Applications of Computer Vision, WACV 2021, pp. 2734–2743, Nov. 2020.
    [20] R. Yamashita, M. Nishio, R. Kinh, G. Do, and K. Togashi, “Convolutional neural networks: an
    overview and application in radiology”, doi: 10.1007/s13244-018-0639-9.
    [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
    recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi:
    10.1109/5.726791.
    [22] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical
    image database,” pp. 248–255, Mar. 2010, doi: 10.1109/CVPR.2009.5206848.
    [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings
    of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-
    December, pp. 770–778, Dec. 2015, doi: 10.48550/arxiv.1512.03385.
    [24] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
    Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
    Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
    [25] M. S. Islam, M. S. Sultana, U. K. Roy, and J. al Mahmud, “A review on Video Classification with
    Methods, Findings, Performance, Challenges, Limitations and Future Work,” Jurnal Ilmiah Teknik
    Elektro Komputer dan Informatika, vol. 6, no. 2, p. 47, Jan. 2021, doi:
    10.26555/JITEKI.V6I2.18978.
    50
    [26] Y. Gao, “News Video Classification Model Based on ResNet-2 and Transfer Learning,” Security and
    Communication Networks, vol. 2021, 2021, doi: 10.1155/2021/5865200.
    [27] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A Critical Review of Recurrent Neural Networks for
    Sequence Learning,” arXiv:1506.00019, May 2015.
    [28] S. Siami-Namini, N. Tavakoli, and A. S. Namin, “The Performance of LSTM and BiLSTM in
    Forecasting Time Series,” Proceedings - 2019 IEEE International Conference on Big Data, Big Data
    2019, pp. 3285–3292, Dec. 2019, doi: 10.1109/BIGDATA47090.2019.9005997.
    [29] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on
    Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
    [30] A. Hannun, “Sequence Modeling with CTC,” Distill, vol. 2, no. 11, p. e8, Nov. 2017, doi:
    10.23915/DISTILL.00008.
    [31] A. Graves, A. Ch, S. Fernández, F. Gomez, J. Schmidhuber, and J. Ch, “Connectionist Temporal
    Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” ACM
    International Conference Proceeding Series, vol. 148, pp. 369–376, 2006.
    [32] J. Summaira, A. Muhammad Shoib, O. Bourahla, L. Songyuan, and J. Abdul, “Recent Advances and
    Trends in Multimodal Deep Learning: A Review,” arXiv preprint, vol. 2105, no. 11087, 2021.
    [33] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online Detection and
    Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,”
    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
    Recognition, vol. 2016-December, pp. 4207–4215, Dec. 2016, doi: 10.1109/CVPR.2016.456.
    [34] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
    Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
    1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
    [35] A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing
    Systems, vol. 201-December, pp. 5999–6009, Jun. 2017.
    [36] J. Pu, W. Zhou, and H. Li, “Iterative Alignment Network for Continuous Sign Language
    Recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
    Pattern Recognition, vol. 2019-June, pp. 4160–4169, Jun. 2019.
    [37] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
    Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
    2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
    [38] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for continuous sign language
    recognition,” Proceedings - IEEE International Conference on Multimedia and Expo, vol. 2019-July,
    pp. 1282–1287, Jul. 2019, doi: 10.1109/ICME.2019.00223.
    [39] Y. Min, A. Hao, X. Chai, and X. Chen, “Visual Alignment Constraint for Continuous Sign Language
    Recognition,” Proceedings of the IEEE International Conference on Computer Vision, pp. 11522–
    11531, Apr. 2021.
    51
    [40] R. Takahashi, T. Matsubara, and K. Uehara, “Data Augmentation using Random Image Cropping
    and Patching for Deep CNNs,” IEEE Transactions on Circuits and Systems for Video Technology,
    vol. 30, no. 9, pp. 2917–2931, Nov. 2018.
    [41] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” ECCV2020, vol. 8693 LNCS, no.
    PART 5, pp. 740–755, May 2014.
    [42] A. Sengupta, F. Jin, R. Zhang, and S. Cao, “mm-Pose: Real-Time Human Skeletal Posture
    Estimation using mmWave Radars and CNNs,” IEEE Sensors Journal, vol. 20, no. 17, pp. 10032–
    10044, Nov. 2019, doi: 10.1109/JSEN.2020.2991741.
    [43] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep High-Resolution Representation Learning for Human
    Pose Estimation,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
    Pattern Recognition, vol. 2019-June, pp. 5686–5696, Feb. 2019.
    [44] S. Chen, M. Zhang, X. Yang, Z. Zhao, T. Zou, and X. Sun, “The Impact of Attention Mechanisms on
    Speech Emotion Recognition,” Sensors 2021, Vol. 21, Page 7530, vol. 21, no. 22, p. 7530, Nov.
    2021, doi: 10.3390/S21227530.
    [45] A. A. Baffour, Z. Qin, Y. Wang, Z. Qin, and K. K. R. Choo, “Spatial self-attention network with selfattention distillation for fine-grained image recognition,” Journal of Visual Communication and
    Image Representation, vol. 81, p. 103368, Nov. 2021, doi: 10.1016/J.JVCIR.2021.103368.
    [46] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
    Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
    2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
    [47] D. Guo, S. Wang, Q. Tian, and M. Wang, “Dense temporal convolution network for sign language
    translation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2019-August, pp.
    744–750, 2019, doi: 10.24963/IJCAI.2019/105.
    [48] S. Wang, D. Guo, W. G. Zhou, Z. J. Zha, and M. Wang, “Connectionist temporal fusion for sign
    language translation,” MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, pp.
    1483–1491, Oct. 2018, doi: 10.1145/3240508.3240671.
    [49] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “SubUNets: End-to-End Hand Shape and
    Continuous Sign Language Recognition,” Proceedings of the IEEE International Conference on
    Computer Vision, vol. 2017-October, pp. 3075–3084, Dec. 2017, doi: 10.1109/ICCV.2017.332.
    [50] D. Guo, W. Zhou, H. Li, and M. Wang, “Hierarchical LSTM for Sign Language Translation,”
    Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 6845–6852, Apr.
    2018, doi: 10.1609/AAAI.V32I1.12235.
    [51] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “SF-Net: Structured Feature Network for Continuous Sign
    Language Recognition,” Aug. 2019, doi: 10.48550/arxiv.1908.01341.
    [52] K. L. Cheng, Z. Yang, Q. Chen, and Y.-W. Tai, “Fully Convolutional Networks for Continuous Sign
    Language Recognition,” ECCV, Jul. 2020, doi: 10.1007/978-3-030-58586-0_41.
    52
    [53] O. Koller, H. Ney, and R. Bowden, “Deep Hand: How to Train a CNN on 1 Million Hand Images
    When Your Data Is Continuous and Weakly Labelled,” Proceedings of the IEEE Computer Society
    Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 3793–3802,
    2016.
    [54] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling Robust Statistical Continuous
    Sign Language Recognition via Hybrid CNN-HMMs,” International Journal of Computer Vision, vol.
    126, no. 12, pp. 1311–1325, Dec. 2018, doi: 10.1007/S11263-018-1121-3/TABLES/8.
    [55] F. ben Slimane and M. Bouguessa, “Context Matters: Self-Attention for Sign Language
    Recognition,” Proceedings - International Conference on Pattern Recognition, pp. 7884–7891, Jan.
    2021, doi: 10.48550/arxiv.2101.04632.
    [56] Z. Niu and B. Mak, “Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous
    Sign Language Recognition,” ECCV 2020: Computer Vision – ECCV 2020, vol. 12361 LNCS, pp. 172–
    186, 2020, doi: 10.1007/978-3-030-58517-4_11/TABLES/3.
    [57] O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream
    CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions
    on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, Sep. 2020, doi:
    10.1109/TPAMI.2019.2911077.
    [58] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
    Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
    1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
    [59] J. Pu, W. Zhou, H. Hu, and H. Li, “Boosting Continuous Sign Language Recognition via Cross
    Modality Augmentation,” MM 2020 - Proceedings of the 28th ACM International Conference on
    Multimedia, pp. 1497–1505, Oct. 2020, doi: 10.1145/3394171.3413931.
    [60] R. Zuo and B. Mak, “C2SLR: Consistency-enhanced Continuous Sign Language Recognition,” CVPR
    2022, pp. 5131–5140, 2022

    QR CODE
    :::