跳到主要內容

簡易檢索 / 詳目顯示

研究生: 張檳妮
Binni Zhang
論文名稱: 基於人體骨骼圖的手語動作偵測與辨識
Skeleton based continuous sign language action detection and recognition
指導教授: 曾定章
Din-Chang Tseng
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 66
中文關鍵詞: 動作偵測動作分類
外文關鍵詞: action detection, action recognition
相關次數: 點閱:5下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度學習技術應用於圖像識別的研究在最近十年間取得巨大的發展與進步,人們已經開始在社會生活中的各個領域内頻繁使用深度學習的方法自動分析視覺資料以獲取需要的資訊,並且這些嘗試大多取得了很好的效果。隨著計算機硬體計算能力的提升與計算方法的不斷優化,這類研究與嘗試已經從處理圖像 (image) 資料擴展到處理視訊影片 (video) 資料。
    聽覺障礙與語言功能障礙者在社會生活中往往有很多不便,尤其是在與非語言障礙人士溝通時。在深度學習與圖像識別發展的過程中,人們一直嘗試通過計算機幫助他們,使他們可以使用自己常用的手勢語言與非語言障礙人士自由地溝通。本論文即結合深度學習和視訊處理技術發展手語動作偵測與辨識研究,通過搭建深度學習網路來處理手語影像,並將其翻譯為文字資訊。為實現這一目的,我們搭建的網路大概可劃分爲三個步驟進行。
    第一為骨骼特徵擷取;在人體動作影像中往往有很多與動作無關的資訊;例如,背景、衣服或髮型等,爲了排除這些無關資訊對結果的影響,我們先使用Openpose模組從視訊資料中擷取出每一幀的人體骨骼圖,也就是只保留了與動作相關的資訊。之後對骨骼圖使用一個特殊的卷積網路“圖卷積網路”(graph convolutional network, GCN) 來擷取出動作特徵。第二為分割疑似動作片段;在獲得視訊每一幀的動作特徵後,我們使用小型的卷積網路尋找動作的起始與末端時間節點,結合時間節點内是否為動作的初步判斷,從整段影像中分割出一些疑似動作片段。第三為對這些疑似動作片段進行分類;並且通過非極大值抑制去除在時間上重疊的疑似片段,得到最終的動作偵測與辨識結果。
    在實驗中我們使用CSLR (Chinese Sign Language Recognition Dataset) 資料集,資料集為2D RGB手語影片,每個影片的長度在10秒以内,影片中錄製者面朝錄製設備。取15個連續手語語句,並對其中的31個單詞進行了分類。採用交替固定部分參數的方式訓練三個模組,完成網路訓練後我們與其他使用過相同資料集的手語識別網路的結果進行了比較,我們的句子精準度達到了84.5%。
    本文的特色主要有兩點;其一,使用人體骨骼關鍵點圖表示動作,並使用圖卷積提取特徵,過濾了與動作無關的背景特徵;其二,使用小型卷積網路偵測動作的開始與結束時間點,計算量較少且找到的動作長度靈活。


    The application of deep learning technology to image recognition has made tremendous development and progress in the past decade. People have begun to frequently use deep learning methods in various fields of social life to automatically analyze visual data to obtain the required information, and these most of the attempts have achieved very good results. With the improvement of computer hardware computing capabilities and the continuous optimization of computing methods, such research and attempts have expanded from processing image data to processing video data.
    People with hearing impairment and language dysfunction often have a lot of inconveniences in social life, especially when communicating with people with non-verbal disabilities. In the process of deep learning and image recognition development, people have been trying to help them through computers so that they can use their commonly used gesture language to communicate freely with people with non-verbal disabilities. This thesis combines deep learning and video processing technology to develop sign language motion detection and recognition research, builds a deep learning network to process sign language images, and translates them into textual information. To achieve this goal, the network we built can be divided into three steps.
    First step, bone feature extraction; in human motion images, there is often a lot of information that is not related to movement; for example, background, clothing, or hairstyle, etc., in order to exclude the influence of these irrelevant information on the result, we first use the Openpose module to extract video data Each frame of the human skeleton is extracted from the frame, that is, only the information related to the movement is retained. Afterwards, a special convolutional network "graph convolutional network" (GCN) is used to extract the motion features for the skeletal graph. The second is to segment the suspected motion fragments; after obtaining the motion characteristics of each frame of the video, we use a small convolutional network to find the start and end time nodes of the motion, combined with the preliminary judgment of whether the motion is within the time node, from the whole Some suspicious motion clips are segmented from the video. The third is to classify these suspicious motion fragments; and remove the suspicious fragments overlapping in time by non-maximum suppression to obtain the final motion detection and recognition results.
    In the experiment, we use the CSLR (Chinese Sign Language Recognition Dataset) data set, which is a 2D RGB sign language videos’ data set. The length of each video is less than 10 seconds. The recorders in the videos face the recording device. We take 15 consecutive sign language sentences and 31 words of them to classify. The three modules were trained by alternately fixing some parameters. After completing the network training, we compared the results with other sign language recognition networks that used the same data set. Our sentence accuracy reached 84.5%.
    There are two main features of this article; one is the use of human Skeleton key points maps to represent actions, and the graph convolution is used to extract features, so that background features which are not related to actions are filtered; second, we use small convolutional networks to detect actions’ start and end time points, less calculation and flexible length of the action found.

    摘要 i Abstract iii 致謝 v 目錄 vi 圖目錄 viii 表目錄 ix 第一章 緒論 1 1.1. 研究動機 1 1.2. 系統架構 2 1.3. 論文架構 4 第二章 相關研究 5 2.1. 基於骨骼架構的手勢辨識 5 2.2. 圖卷積 7 2.3. 手語辨識 9 第三章 網路整體架構 11 3.1. 擷取骨骼特徵模組 11 3.2. 分割疑似片段模組 23 3.3. 動作分類模組 27 第四章 實驗結果與討論 32 4.1. 實驗設備 32 4.2. 卷積神經網路訓練 32 4.3. 評估準則與實驗結果 36 第五章 結論與未來展望 42 5.1. 結論 42 5.2. 未來展望 42 參考文獻 44 附錄一 資料集分類 50

    [1]S. Ren, K. He, R.Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Conf. on Neural Information Processing Systems(NIPS), Montréal, Canada, Dec.7-12, 2015.
    [2]Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2D pose estimation using part affinity fields” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii , Jul.21-26, 2017, pp.7291-7299.
    [3]T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “BSN: Boundary sensitive network for temporal action proposal generation,” in Proc. Conf. on European Conf. on Computer Vision (ECCV), Munich, Germany, Sept.8-14, 2018, pp.3-19.
    [4]J. Liu, A. Shahroudy, D. Xu, and G. Wang. “Spatio-Temporal LSTM with trust gates for 3D human action recognition,” in Proc. Conf. on European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, Oct.8-16, 2016, pp.816-833.
    [5]J. Weng, M. Liu, X. Jiang, and J. Yuan. “Deformable pose traversal convolution for 3D action and gesture recognition,” in Proc. Conf. on European Conference on Computer Vision (ECCV), Munich, Germany, Sept.8-14, 2018, pp.136-152.
    [6]M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in Proc. Conf. on Machine Learning, New York, NY, Jun.19-24, 2016, pp.2014-2023.
    [7]C. Wan, T. Probst, L. V. Gool, and A. Yao, “Dense 3D regression for hand pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.5147-5156.
    [8]L. Ge, H. Liang, J. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.1991-2000.
    [9]L. Ge, Y. Cai, J. Weng, and J. Yuan, “Hand PointNet: 3D hand pose estimation using point sets,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.8417-8426.
    [10]G. Devineau, F. Moutarde ,W. Xi, and J. Yang, “Deep learning for hand gesture recognition on skeletal data,” in Proc. IEEE Conf. on Automatic Face & Gesture Recognition (FG 2018), Xi'an, China, May.15-19, 2018, pp.106-113.
    [11]Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3D action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.3288-3297.
    [12]M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol.68, pp.346-362, 2017.
    [13]J. Núñez, C., R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F.Vélez, “Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition,” Pattern Recognition, vol.76, pp.80-94, 2018.
    [14]H. Wang, and L. Wang, “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.499-508.
    [15]R. Vemulapalli, F.Arrate, and R. Chellappa Human, “Action recognition by representing 3D skeletons as points in a lie group,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, Jun.23-28, 2014, pp.588-595.
    [16]X. Nguyen, S., L. Brun, O. Lezoray, and S. Bougleux, “A neural network based on SPD manifold learning for skeleton-based hand gesture recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.12036-12045.
    [17]A. Urooj, and A. Borji, “Analysis of hand segmentation in the wild,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.4710-4719.
    [18]M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.1165-1174.
    [19]C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang, “Spatio-temporal graph convolution for skeleton based action recognition,” in Proc. Conf. on Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.3482-3489.
    [20]S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition.” in Proc. Conf. on Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.7444-7452.
    [21]L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Non-local graph convolutional networks for skeleton-based action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.12026-12035.
    [22]A. Graves, S. Fernánde, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. Conf. on Machine Learning, Pittsburgh, PA, Jun.25-29, 2006, pp.369-376.
    [23]A. Grover, and J. Leskovec, “node2vec: scalable feature learning for networks,” in Proc. ACM SIGKOD Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Aug.13-17, 2016, pp.855-864.
    [24]W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” arXiv:1706.02216, 2017.
    [25]J. Atwood, and D. Towsley, “Diffusion-convolutional neural networks,” arXiv:1511.02136, 2015.
    [26]J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proc. Conf. on Machine Learning, vol.70, Sydney, Australia, Aug.6-11, 2017, pp.1263-1272.
    [27]M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proc. Conf. on Neural Information Processing Systems (NIPS), Barcelona, Spain, Dec.5-10, 2016, pp.3844-3852.
    [28]Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: data-driven traffic forecasting,” arXiv:1707.01926, 2018.
    [29]N. Camgoz, C., S. Hadfield, O. Koller, and R. Bowden, “SubUNets: end-to-end hand shape and continuous sign language recognition,” in Proc. IEEE Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.3075-3084.
    [30]J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous sign language recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun. 16-20, 2019, pp. 4165-4174.
    [31]N. Camgoz, C., S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.7784-7793.
    [32]R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks for continuous sign language recognition by staged optimization,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.7361-7369.
    [33]S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - video to text,” in Proc. IEEE Conf. on Computer Vision (ICCV), Santiago, Chile, Dec.11-16, 2015, pp.4534-4542.
    [34]B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,” in Proc. Conf. on Artificial Intelligence (IJCAI), Stockholm, Sweden, Jul.13-19, 2018, pp.3634-3640.
    [35]Thomas N. Kipf, Max Welling, “Semi-supervised classification with graph convolutional network,” arXiv:1609.02907, 2017.
    [36]Z. Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.7291-7299.
    [37]Z. Shou, D.Wang, and Shih-Fu C., “Temporal action localization in untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.1049-1058.
    [38]J. Gao, Z. Yang, C. Sun, K. Chen, R. Nevatia, “TURN TAP: Temporal unit regression network for temporal action proposals,” in Proc. IEEE Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.3628-3636.
    [39]S. Venugopalan, H. Xu , J. Donahue, M. Rohrbach, R. Mooney, K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv:1412.4729, 2014.
    J. Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, Weiping Li, “Video-based sign language recognition without temporal segmentation,” in Proc. Conf. on Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.2257-2264.

    QR CODE
    :::