基於通道拓樸增強圖卷積神經網絡之手語單詞辨識演算法

簡易檢索 / 詳目顯示

回結果列表

研究生：	董致輔 Chih-Fu Tung
論文名稱：	基於通道拓樸增強圖卷積神經網絡之手語單詞辨識演算法 A CTRGCN-based model for Isolated Sign Language Recognition
指導教授：	蘇木春 Mu-Chun Su
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2024
畢業學年度：	112
語文別：	中文
論文頁數：	50
中文關鍵詞：	深度學習、骨架辨識、手語單詞辨識、圖卷積神經網路
外文關鍵詞：	Deep learning, Skeleton recognition, Sign language recognition, Graph convolutional neural network
相關次數：	點閱：14 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，聽障人士的人口逐漸增長，大眾對於手語學習的需求也跟
著逐年提升，然而，手語學習的困難度高，且學習資源有限，使得手語
學習成為一個困難的任務。
為了解決這個問題，本論文提出了一種基於通道拓樸增強圖卷積神
經網絡（CTRGCN）的基於骨架手語單詞辨識演算法。本研究針對手語
單詞辨識，設計了改良的CTRGCN 模型，並提出多分支的架構，以提高
辨識準確度。我們使用WLASL100 數據集進行訓練，並與現有模型進行
了的比較。結果顯示，我們的方法在多數情境下均優於現有技術，展示
了其在手語單詞辨識上的潛力和實用性，並希望為手語學習提供更多的
幫助。

In recent years, the population of hearing-impaired individuals has been
gradually increasing, and the public’s demand for sign language learning has
been steadily rising as well. However, the difficulty of learning sign language is
high, and the learning resources are limited, making it a relatively challenging
task.
To address this issue, this paper proposes a Skeleton based sign language
word recognition algorithm based on Channel-Topology Refinement Graph Convolutional
Network (CTRGCN). This method tackles the challenges in sign language
word recognition, by designing an improved CTRGCN model to enhance
recognition accuracy. We trained the model using the WLASL100 dataset and
compared it with existing models. The results demonstrate that our method outperforms
existing techniques in most scenarios, showcasing its potential and
practicality in sign language word recognition. We hope to provide more assistance
for sign language learning through this approach.

一、緒論1
1 研究動機.................................................................. 1
2 研究目標.................................................................. 3
3 論文架構.................................................................. 4
二、背景知識以及文獻回顧5
1 背景知識.................................................................. 5
1.1 各種手語......................................................... 5
1.2 手語辨識種類................................................... 7
1.3 圖卷積(GCN) 介紹............................................. 8
2 文獻回顧.................................................................. 10
2.1 關鍵點偵測之相關研究....................................... 10
2.2 基於骨架動作辨識之相關研究.............................. 12
2.3 基於3DCNN 的影片辨識相關研究......................... 15
2.4 基於骨架手語單詞辨識之相關研究........................ 15
三、研究方法20
1 系統架構.................................................................. 20
2 前處理..................................................................... 21
3 模型架構.................................................................. 25
3.1 CTRGCN 模型.................................................. 25
3.2 修改後的CTRGCN 模型...................................... 27
3.3 多分支架構...................................................... 29
3.4 模型結果合併方法............................................. 30
3.5 融合RGB 結果.................................................. 31
四、實驗設計與結果32
1 資料集..................................................................... 32
2 實驗配置.................................................................. 34
3 實驗結果評估............................................................ 36
3.1 比較額外分支結果............................................. 36
3.2 比較不同分支合併的方法.................................... 39
3.3 比較減少層數後的效果....................................... 40
3.4 比較模塊修改後的效果....................................... 40
3.5 比較不同的分支組合.......................................... 42
3.6 比較不同的參數................................................ 43
3.7 與現有手語單詞辨識模型比較.............................. 44
五、總結46
1 結論........................................................................ 46
2 未來展望.................................................................. 47
參考文獻48

                                

[1] W. H. Organization. “Deafness and hearing loss — who.int.” (2024), [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss (visited
on 05/18/2024).
[2] D. Li, C. Rodriguez, X. Yu, and H. Li, “Word-level deep sign language recognition from
video: A new large-scale dataset and methods comparison,” in The IEEE Winter Conference
on Applications of Computer Vision, 2020, pp. 1459–1469.
[3] 教育部國民及學前教育署. “學齡前2 至6 歲教保服務人員手語手冊,” [Online].
Available: https://www.ece.moe.edu.tw/ch/special_education/skill/skill_0002/ (visited
on 06/11/2024).
[4] 李信賢. “國際手語(is) 是否為一種語言？.” (2019), [Online]. Available: https : / /
taslifamily.org/?p=4826 (visited on 05/18/2024).
[5] E. Drasgow. “American sign language.” (2024), [Online]. Available: https : / / www .
britannica.com/topic/American-Sign-Language (visited on 05/20/2024).
[6] D. W. Vicars. “Gloss,” [Online]. Available: https://www.lifeprint.com/asl101/topics/
gloss.htm (visited on 05/20/2024).
[7] 中華民國啟聰協會. “台灣手語介紹及手語qa,” [Online]. Available: https://www.
deaf.org.tw/OnePage.aspx?mid=51&id=46 (visited on 05/20/2024).
[8] SignTube, 台灣手語南北差異1 tsl dialects (1), YouTube, Accessed: 2024-06-02, 2023.
[9] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional
networks,” arXiv preprint arXiv:1609.02907, 2016.
[10] C. Lugaresi, J. Tang, H. Nash, et al., “Mediapipe: A framework for building perception
pipelines,” arXiv preprint arXiv:1906.08172, 2019.
[11] google-ai-edge. “Mediapipe holistic.” Accessed: 2024-06-02. (2022), [Online]. Available:
https://github.com/google-ai-edge/mediapipe/blob/master/docs/solutions/holistic.
md.
[12] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose: Realtime multi-person
2d pose estimation using part affinity fields,” CoRR, vol. abs/1812.08008, 2018. arXiv:
1812.08008. [Online]. Available: http://arxiv.org/abs/1812.08008.
[13] T. Jiang, P. Lu, L. Zhang, et al., “Rtmpose: Real-time multi-person pose estimation based
on mmpose,” arXiv preprint arXiv:2303.07399, 2023.
[14] A. Sengupta, F. Jin, R. Zhang, and S. Cao, “Mm-pose: Real-time human skeletal posture
estimation using mmwave radars and cnns,” IEEE Sensors Journal, vol. 20, no. 17,
pp. 10 032–10 044, 2020.
[15] C. Li, P. Wang, S. Wang, Y. Hou, and W. Li, “Skeleton-based action recognition using
LSTM and CNN,” CoRR, vol. abs/1707.02356, 2017. arXiv: 1707.02356. [Online].
Available: http://arxiv.org/abs/1707.02356.
[16] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeletonbased
action recognition,” CoRR, vol. abs/1801.07455, 2018. arXiv: 1801.07455. [Online].
Available: http://arxiv.org/abs/1801.07455.
[17] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Adaptive spectral graph convolutional networks
for skeleton-based action recognition,” CoRR, vol. abs/1805.07694, 2018. arXiv: 1805.
07694. [Online]. Available: http://arxiv.org/abs/1805.07694.
[18] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement
graph convolution for skeleton-based action recognition,” CoRR, vol. abs/2107.12213,
2021. arXiv: 2107.12213. [Online]. Available: https://arxiv.org/abs/2107.12213.
[19] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics
dataset,” CoRR, vol. abs/1705.07750, 2017. arXiv: 1705.07750. [Online]. Available:
http://arxiv.org/abs/1705.07750.
[20] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature
learning for video understanding,” CoRR, vol. abs/1712.04851, 2017. arXiv: 1712.04851.
[Online]. Available: http://arxiv.org/abs/1712.04851.
[21] A. Tunga, S. V. Nuthalapati, and J. P. Wachs, “Pose-based sign language recognition
using GCN and BERT,” CoRR, vol. abs/2012.00781, 2020. arXiv: 2012.00781. [Online].
Available: https://arxiv.org/abs/2012.00781.
[22] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional
transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. arXiv:
1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805.
[23] M. Boháček and M. Hrúz, “Sign pose-based transformer for word-level sign language
recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV) Workshops, Jan. 2022, pp. 182–191.
[24] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” CoRR, vol. abs/
1706.03762, 2017. arXiv: 1706.03762. [Online]. Available: http://arxiv.org/abs/1706.
03762.
[25] H. Hu, W. Zhao, W. Zhou, and H. Li, “Signbert+: Hand-model-aware self-supervised
pre-training for sign language understanding,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 45, no. 9, pp. 11 221–11 239, Sep. 2023, ISSN: 1939-3539.
DOI: 10.1109/tpami.2023.3269220. [Online]. Available: http://dx.doi.org/10.1109/
TPAMI.2023.3269220.
[26] D. Laines, G. Bejarano, M. Gonzalez-Mendoza, and G. Ochoa-Ruiz, Isolated sign language
recognition based on tree structure skeleton images, 2023. arXiv: 2304 . 05403
[cs.CV].
[27] M. Contributors. “Openmmlab pose estimation toolbox and benchmark.” Accessed: 2024-
06-02. (2020), [Online]. Available: https://github.com/open-mmlab/mmpose.
[28] jin-s13. “Coco-wholebody.” (2020), [Online]. Available: https://github.com/jin- s13/
COCO-WholeBody/ (visited on 06/02/2024).
[29] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph
convolutions for skeleton-based action recognition,” CoRR, vol. abs/2003.14111, 2020.
arXiv: 2003.14111. [Online]. Available: https://arxiv.org/abs/2003.14111.
[30] A. G. Howard, M. Zhu, B. Chen, et al., “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. arXiv: 1704.
04861. [Online]. Available: http://arxiv.org/abs/1704.04861.
[31] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Sign language recognition via
skeleton-aware multi-model ensemble,” CoRR, vol. abs/2110.06161, 2021. arXiv: 2110.
06161. [Online]. Available: https://arxiv.org/abs/2110.06161.
[32] R. Zuo, F. Wei, and B. Mak, Natural language-assisted sign language recognition, 2023.
arXiv: 2303.12080 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2303.12080.
[33] D. Li, X. Yu, C. Xu, L. Petersson, and H. Li, “Transferring cross-domain knowledge for
video sign language recognition,” CoRR, vol. abs/2003.03703, 2020. arXiv: 2003.03703.
[Online]. Available: https://arxiv.org/abs/2003.03703.

簡易檢索 / 詳目顯示

相關論文