彩色手套影像下基於 EANet 的手部姿態預測方法｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	徐嘉彤 Chia-Tung Hsu
論文名稱：	彩色手套影像下基於 EANet 的手部姿態預測方法 A Hand Pose Estimation Method Based on EANet with Colored Glove Images
指導教授：	蘇木春 Mu-Chun Su
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2024
畢業學年度：	112
語文別：	中文
論文頁數：	67
中文關鍵詞：	深度學習、電腦視覺、電腦圖學、影像處理、3D 手部姿態辨識
外文關鍵詞：	Deep Learning, Computer Vision, Computer Graphics, Image Processing, 3D Hand Pose Estimation
相關次數：	點閱：7 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　　台灣的聽障人口數超過 13 萬人，手語是這些人的主要溝通方式。對於手語翻譯以及手語辨識等應用，準確的手部姿態預測模型至關重要。然而，由於雙手的互動與手部的遮擋，此任務對單一鏡頭的 RGB 影像是一大挑戰。因此，本研究旨在提升雙手手語場景的手部姿態預測結果。
　　本論文提出了一種應用 Extract-and-adaptation network(EANet)與彩色手套的手部姿態預測方法，並針對彩色手套手語影像進行優化。我
們使用將資料集渲染成彩色手套的方式增加手指的資訊，並採用基於Transformer 架構的 EANet 進行模型訓練，再使用多種影像處理技術來優化手部關鍵點的預測結果。實驗結果顯示，該方法在彩色手套手語資料
集上完整偵測雙手的穩定性高於 Mediapipe 55%，亦在測試資料集中得到
比使用原始資料集訓練的 EANet 更好的結果。

With over 130,000 hearing-impaired individuals in Taiwan, sign language serves as their primary mode of communication. Accurate hand pose estimation models are crucial for applications such as sign language translation and recognition. However, due to interactions between two hands and occlusions, this task poses a significant challenge for single RGB images. This study aims to enhance hand pose estimation in two-hand sign language scenarios.
This research proposes a hand pose estimation method using Extract-and-adaptation network (EANet) and colored gloves, optimized for sign language images with colored gloves. We enhance finger information by rendering the dataset into colored gloves and employ a Transformer-based EANet for model training. Additionally, multiple image processing techniques were employed to optimize the prediction result of hand keypoints. Experimental results demonstrate that our method achieves a 55% higher stability in detecting two hands on sign language datasets compared to Mediapipe and yields superior results on test datasets compared to EANet trained on the original dataset.

摘要 i
Abstract ii
誌謝 iii
目錄 iv
一、 緒論 1
1 研究動機 .................................................................. 1
2 研究目的 .................................................................. 3
3 論文架構 .................................................................. 4
二、 文獻回顧 5
1 手部姿態預測的應用 ................................................... 6
2 手部姿態預測之相關研究 ............................................. 7
2.1 基於感測器的手部姿態預測方法 ........................... 7
2.2 基於深度資訊的手部姿態預測方法 ........................ 8
2.3 基於 RGB 影像之手部姿態預測方法....................... 8
三、 研究方法 11
1 資料集前處理 ............................................................ 11
1.1 資料集 ............................................................ 11
1.2 資料集渲染 ...................................................... 14
2 手部姿態預測 ............................................................ 18
3 手語影像前處理 ......................................................... 21
3.1 HSV 色彩空間下的多顏色遮罩提取........................ 21
3.2 去除雜訊 ......................................................... 22
3.3 HSV 色彩空間下的色彩線性轉換........................... 24
3.4 影像對比度增強 ................................................ 25
四、 實驗設計與結果 27
1 手部遮蔽之手語資料集 ................................................ 27
1.1 資料集拍攝 ...................................................... 27
1.2 資料標註 ......................................................... 29
2 EANet 實驗數據 ......................................................... 30
2.1 實驗細節 ......................................................... 30
2.2 評估指標 ......................................................... 30
2.3 EANet 訓練結果 ................................................ 31
3 手語資料集實驗 ......................................................... 33
3.1 不同模型的實驗結果 .......................................... 33
3.2 雙手偵測實驗 ................................................... 36
3.3 影像處理實驗分析 ............................................. 38
4 彩色手套資料集在 State-of-the-Art 模型訓練結果................ 40
五、 總結 41
1 結論 ........................................................................ 41
2 未來展望 .................................................................. 42
參考文獻 43
附錄 A 手語資料集 47
A.1 手套手語資料集 ......................................................... 47
A.2 赤手手語資料集 ......................................................... 51
                                

[1] 衛生福利部統計處.“身心障礙統計專區.”(2021), [Online]. Available: https://dep.mohw.gov.tw/dos/cp-5224-62359-113.html (visited on 05/31/2024).
[2] 教育部國民及學前教育署.“十二年國民基本教育課程綱要語文領域─臺灣手語.” (2021), [Online]. Available: https://www.k12ea.gov.tw/Tw/Common/SinglePage?
filter=11C2C6C1-D64E-475E-916B-D20C83896343 (visited on 06/02/2024).
[3] F. Zhang, V. Bazarevsky, A. Vakunov, et al., “Mediapipe hands: On-device real-time hand tracking,” arXiv preprint arXiv:2006.10214, 2020.
[4] J. Park, D. S. Jung, G. Moon, and K. M. Lee, “Extract-and-adaptation network for 3d interacting hand mesh recovery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4200–4209.
[5] A. Sinha, C. Choi, and K. Ramani, “Deephand: Robust hand pose estimation by completing a matrix imputed with deep features,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 4150–4158.
[6] Y. He, R. Yan, K. Fragkiadaki, and S.-I. Yu, “Epipolar transformers,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2020, pp. 7779-7788.
[7] M. Nishiyama and K. Watanabe, “Wearable sensing glove with embedded hetero-core fiber-optic nerves for unconstrained hand motion capture,” IEEE Transactions on In
strumentation and Measurement, vol. 58, no. 12, pp. 3995–4000, 2009.
[8] Z. Shen, J. Yi, X. Li, et al., “A soft stretchable bending sensor and data glove applications,” Robotics and biomimetics, vol. 3, no. 1, p. 22, 2016.
[9] G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, Springer, 2020, pp. 548–564.
[10] 全國特殊教育資訊網.“常用手語辭典手語六十基本手勢.”(2019),[Online].Available: https://special.moe.gov.tw/signlanguage/basis/detail/7c338ab6-87a9-46cc-a50f34b82b4dac8a (visited on 06/02/2024).
[11] Y.Jang,S.-T.Noh,H.J.Chang,T.-K.Kim,andW.Woo,“3dfingercape:Clickingaction and position estimation under self-occlusions in egocentric viewpoint,” IEEE Transac
tions on Visualization and Computer Graphics, vol. 21, no. 4, pp. 501–510, 2015.
[12] O. G. Guleryuz and C. Kaeser-Chen, “Fast lifting for 3d hand pose estimation in ar/vr applications,” in 2018 25th IEEE International Conference on Image Processing (ICIP),
IEEE, 2018, pp. 106–110.
[13] M.-Y. Wu, P.-W. Ting, Y.-H. Tang, E.-T. Chou, and L.-C. Fu, “Hand pose estimation in object-interaction based on deep learning for virtual reality applications,” Journal of
Visual Communication and Image Representation, vol. 70, p. 102802, 2020.
[14] Y. Che and Y. Qi, “Detection-guided 3d hand tracking for mobile ar applications,” in 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), IEEE,
2021, pp. 386–392.
[15] T.LeeandT.Hollerer,“Multithreaded hybrid feature tracking for markerless augmented
reality,” IEEEtransactionsonvisualizationandcomputergraphics,vol.15,no.3,pp.355-368, 2009.
[16] E.Ueda,Y.Matsumoto,M.Imai,andT.Ogasawara,“Ahand-poseestimationforvision based human interfaces,” IEEE Transactions on Industrial Electronics, vol. 50, no. 4,
pp. 676–684, 2003.
[17] F.Yin, X. Chai, and X. Chen, “Iterative reference driven metric learning for signer independent isolated sign language recognition,” in Computer Vision–ECCV 2016: 14th Eu
ropean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, Springer, 2016, pp. 434–450.
[18] A.Markussen, M.R.Jakobsen, and K. Hornbæk, “Vulture: A mid-air word-gesture keyboard,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Sys
tems, 2014, pp. 1073–1082.
[19] H.J.Chang,G.Garcia-Hernando,D.Tang,andT.-K.Kim,“Spatio-temporalhoughforest for efficient detection–localisation–recognition of fingerwriting in egocentric camera,” Computer Vision and Image Understanding, vol. 148, pp. 87–96, 2016.
[20] D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhancing yolo for multi
personposeestimationusingobjectkeypointsimilarityloss,”inProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, 2022, pp. 2637–2646.
[21] Y.Xu,J.Zhang,Q.Zhang,andD.Tao,“Vitpose:Simplevision transformer baselines for human pose estimation,” Advances in Neural Information Processing Systems, vol. 35,
pp. 38571–38584, 2022.
[22] Z.Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 7291–7299.
[23] T. Sharp, C. Keskin, D. Robertson, et al., “Accurate, robust, and flexible real-time hand tracking,” in Proceedings of the 33rd annual ACM conference on human factors in computing systems, 2015, pp. 3633–3642.
[24] T.Simon,H.Joo,I.Matthews,andY.Sheikh,“Handkeypointdetection in single images
using multiview bootstrapping,” in Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, 2017, pp. 1145–1153.
[25] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM transactions on graphics (TOG), vol. 28, no. 3, pp. 1–8, 2009.
[26] C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in Proceedings of the IEEE international conference on computer vision, 2017,
pp. 4903–4911.
[27] A.Vaswani,N.Shazeer,N.Parmar,etal.,“Attentionisallyouneed,”Advancesinneural
information processing systems, vol. 30, 2017.
[28] A.Dosovitskiy,L.Beyer,A.Kolesnikov, et al., “An image is worth 16x16 words:Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[29] S. Hampali, S. D. Sarkar, M. Rad, and V. Lepetit, “Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11090–11100.
[30] M. Li, L. An, H. Zhang, et al., “Interacting attention graph for single image two-hand reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2761–2770.
[31] J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” arXiv preprint arXiv:2201.02610, 2022.
[32] N. Ravi, J. Reizenstein, D. Novotny, et al., “Accelerating 3d deep learning with pytorch3d,” arXiv:2007.08501, 2020.
[33] B.D.Team.“Blender4.1manual.”(2024), [Online]. Available: https://docs.blender.org/
manual/en/latest/copyright.html (visited on 06/05/2024).
[34] Scratchapixel. “The rasterization stage,” [Online]. Available: https://www.scratchapixel.
com/lessons/3d-basic-rendering/rasterization-practical-implementation/rasterizationstage.html (visited on 06/14/2024).
[35] Scratchapixel. “An overview of the rasterization algorithm,” [Online]. Available: https://
www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/
overview-rasterization-algorithm.html (visited on 06/14/2024).
[36] K.He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.
[37] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
[38] 教育部國民及學前教育署. “臺灣手語教材資源網.”(2022), [Online]. Available: https://jung-hsingchang.tw/twsl/movies.html (visited on 06/01/2024).
[39] K. Wada, Labelme: Image polygonal annotation with python, https://github.com/wkentaro/labelme, 2018.
[40] Google. “Hand landmarks detection guide.” (2024), [Online]. Available: https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker?hl=zh-tw (visited on 06/02/2024).
[41] A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deeplearning library,” Advances in neuralinformation processingsystems, vol. 32, 2019.
[42] C.Ionescu,D.Papava,V.Olaru,andC.Sminchisescu,“Human3.6m:Largescaledatasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.
[43] M.Andriluka,L.Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on
computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
[44] V. U. of Wellington. “Nz sign language exercises.” (2010), [Online]. Available: https: //www.wgtn.ac.nz/llc/llc_resources/nzsl/ (visited on 07/20/2024).

簡易檢索 / 詳目顯示

相關論文