基於深度學習的樂器演奏動作識別｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	赫巴特 Avirmed Enkhbat
論文名稱：	基於深度學習的樂器演奏動作識別 Action Recognition for Music Instruments Playing based on Deep Learning
指導教授：	施國琛 Timothy K. Shih
口試委員:
學位類別：	博士 Doctor
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2024
畢業學年度：	113
語文別：	英文
論文頁數：	78
中文關鍵詞：	人類行為識別、影像分割、圖卷積網絡（GCN）、時間卷積網絡（TCN）、時空注意圖卷積網絡（STA-GCN）、樂器、二胡、馬頭琴
外文關鍵詞：	Human action recognition, Image segmentation, Graph convolutional networks (GCN), Temporal convolutional networks (TCN), Spatial temporal attention graph convolutional network (STA-GCN), instrument, erhu, morin khuur
相關次數：	點閱：18 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在人類行為識別（HAR）在樂器演奏中的應用，是一個重要的研究領域，通過利用人工智慧（AI）來提升音樂教育和演奏評估。這篇論文整合了兩個不同的研究方法：第一個方法著重於識別演奏二胡時的錯誤，第二個方法則探索馬頭琴的音符識別。這兩個方法都運用了深度學習技術，旨在開發能夠識別與樂器演奏相關的複雜動作和模式的模型。
二胡研究應用了圖卷積網絡（GCN）和時間卷積網絡（TCN），以捕捉人類骨骼運動的空間和時間關係。此研究的主要目標是檢測演奏中的錯誤，如手位不正、弓角不當及姿勢問題。系統通過分析演奏者的身體動作，識別影響演奏的技術錯誤，並提出改進技巧的建議。
另一方面，馬頭琴的演奏分析側重於音符識別，而非錯誤檢測。該研究使用了時空注意圖卷積網絡（STA-GCN），以捕捉手部關鍵點與樂器分割資訊之間的關係，從而識別正在演奏的音符。系統分析演奏者的連續手勢並將其映射到相應的音符。音符識別模型達到了81.4%的準確率，展示了其通過手勢識別來分析音樂作品的潛力。
為了支持這兩種方法，研究開發了全面的數據集，涵蓋了專業音樂家和初學者的二胡和馬頭琴演奏。這些數據集捕捉了音樂表演的細節，包括手部動作、指法位置、運弓技巧和音符轉換。針對每種樂器的獨特演奏技巧和表現動態，分別建立了單獨的數據集。
基於二胡的錯誤檢測系統顯示出高準確率，達到了97.6%，能夠識別演奏動作並找出手部和姿勢對齊的常見錯誤。相較之下，馬頭琴音符識別系統在通過手勢識別音符方面達到了81.4%的準確率。這兩個系統在各自領域中展現了潛力，突顯了結合深度學習模型與音樂表演分析的有效性。未來的研究將致力於擴展系統至更多樂器，並優化模型以提升識別與修正的表現。
本研究為音樂科技領域做出了貢獻，提供了針對傳統樂器演奏精度與音符識別的AI驅動解決方案。通過針對二胡和馬頭琴演奏需求量身定制的深度學習技術，本論文提供了一種新穎的人類行為識別與音樂分析方法。

Human action recognition (HAR) in musical instrument performance is an important research area that leverages artificial intelligence (AI) to enhance music education and performance evaluation. This thesis integrates two distinct research approaches: the first focuses on identifying errors while playing the erhu, and the second explores musical note recognition for the morin khuur. Both approaches utilize deep learning techniques to develop models capable of recognizing complex movements and patterns associated with musical instrument performance.
The erhu study applies Graph Convolutional Networks (GCN) and Temporal Convolutional Networks (TCN) to capture both the spatial and temporal relationships of human skeletal movements. The primary objective of this research is to detect performance errors such as incorrect hand positioning, improper bow angles, and posture issues. The system analyzes the body movements of musicians, identifying technical errors that affect performance, and suggests ways to improve playing technique.
On the other hand, the morin khuur performance analysis focuses on musical note recognition rather than error detection. This research involves the use of Spatial Temporal Attention Graph Convolutional Networks (STA-GCN) to capture the relationship between hand keypoints and instrument segmentation information in order to recognize the musical notes being played. The system analyzes the continuous gestures of musicians and maps them to the corresponding musical notes. The note recognition model achieved an accuracy of 81.4%, demonstrating its potential for analyzing musical compositions through gesture recognition.
To support these two approaches, comprehensive datasets were developed, involving both professional musicians and beginners playing the erhu and morin khuur. These datasets capture the intricacies of musical performance, including hand movements, finger positioning, bowing techniques, and note transitions. Separate datasets were built for each instrument to handle their unique playing techniques and performance dynamics.
The erhu-based error detection system demonstrated high accuracy, achieving 97.6% in recognizing playing actions and identifying common errors in hand and posture alignment. In contrast, the morin khuur note recognition system achieved an accuracy of 81.4% in recognizing musical notes from player gestures. Both systems showed potential in their respective areas, highlighting the effectiveness of combining deep learning models with musical performance analysis. Future work will focus on expanding the systems to additional instruments and optimizing the models to improve recognition and correction performance.
This research contributes to the field of music technology by offering distinct AI-driven solutions for improving both performance accuracy and note recognition in traditional musical instruments. By applying deep learning techniques tailored to the specific needs of erhu and morin khuur performances, this thesis offers a novel approach to action recognition and musical analysis.

摘要    i
ABSTRACT    ii
ACKNOWLEDGMENT    iv
TABLE OF CONTENT    v
LIST OF FIGURES    viii
LIST OF TABLES    x
CHAPTER 1.  INTRODUCTION    1
1 Background    1
2 Objective of Research    5
3 Dissertation Outline    6
CHAPTER 2.  LITERATURE REVIEW    8
1 Human Action Recognition (HAR)    8
2 Graph Convolutional Network (GCN)    10
3 Spatial-Temporal Graph Convolutional Networks (ST-GCN)    13
4 Two-Stream Graph Attention Convolutional Network with Spatial-Temporal Attention    14
5 Temporal Convolutional Network (TCN)    17
6 Traditional Method in Action Recognition    18
7 Object Segmentation Methods    21
7.1 Mask-RCNN    21
7.2 YOLACT    23
8 Pose Estimation Methods    27
8.1 MMPose    28
8.2 MediaPipe    29
8.3 OpenPose    30
CHAPTER 3. PROPOSED METHOD    33
1 CNN-GCN-TCN for Error Recognition in Playing the Erhu    34
2 STA-GCN for Morin Khuur Musical Note Recognition    37
3 Loss function    38
CHAPTER 4. DATA PROCESSING    39
1 Data Collection    40
2 Data Segmentation    43
3 Pose estimation processing    45
4 Graph Data Module    46
5 Data Augmentation    49
CHAPTER 5. EXPERIMENTAL RESULTS    52
1 Experiments on the LH Module    52
2 Experiments on the LHLA Module    53
3 Experiments on the RH Module    54
4 Experiments on the RHRA Module    55
5 Experiments on STA-GCN in musical note recognition    55
CHAPTER 6. CONCLUSION AND FUTURE WORKS    58
1 Conclusion    58
2 Future Works    58
REFERENCES    60

                                

[1] K. A. Pati, S. Gururani, and A. Lerch, "Assessment of Student Music Performances Using Deep Neural Networks," Applied Sciences, vol. 8, no. 4, p. 507, 2018. [Online]. Available: https://www.mdpi.com/2076-3417/8/4/507.
[2] S. Zhang, "Online Music Performance Evaluation System Based on Apriori Algorithm," in 2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE), 17-19 Dec. 2022 2022, pp. 673-676, doi: 10.1109/ISAIEE57420.2022.00142.
[3] F. Liwicki et al., Deep Neural Network approaches for Analysing Videos of Music Performances. 2022.
[4] Z. Qin et al., "Fusing Higher-Order Features in Graph Neural Networks for Skeleton-Based Action Recognition," IEEE Transactions on Neural Networks and Learning Systems, vol. PP, pp. 1-15, 09/19 2022, doi: 10.1109/TNNLS.2022.3201518.
[5] F. Luo, S. Poslad, and E. Bodanese, "Temporal Convolutional Networks for Multiperson Activity Recognition Using a 2-D LIDAR," IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7432-7442, 2020, doi: 10.1109/JIOT.2020.2984544.
[6] W. Zhang, Z. Lin, J. Cheng, C. Ma, X. Deng, and H. Wang, "STA-GCN: two-stream graph convolutional network with spatial–temporal attention for hand gesture recognition," The Visual Computer, vol. 36, 10/01 2020, doi: 10.1007/s00371-020-01955-w.
[7] F. Camarena, M. Gonzalez-Mendoza, L. Chang, and R. Cuevas-Ascencio, "An Overview of the Vision-Based Human Action Recognition Field," Mathematical and Computational Applications, vol. 28, no. 2, p. 61, 2023. [Online]. Available: https://www.mdpi.com/2297-8747/28/2/61.
[8] Q. Gao, U. Ogenyi, J. Liu, Z. Ju, and H. Liu, "A Two-Stream CNN Framework for American Sign Language Recognition Based on Multimodal Data Fusion," 2020, pp. 107-118.
[9] A. Sánchez-Caballero, D. Fuentes-Jiménez, and C. Losada-Gutiérrez, "Real-time human action recognition using raw depth video-based recurrent neural networks," Multimedia Tools and Applications, vol. 82, no. 11, pp. 16213-16235, 2023/05/01 2023, doi: 10.1007/s11042-022-14075-5.
[10] R. Sun, Q. Zhang, C. Luo, J. Guo, and H. Chai, "Human action recognition using a convolutional neural network based on skeleton heatmaps from two-stage pose estimation," Biomimetic Intelligence and Robotics, vol. 2, no. 3, p. 100062, 2022/09/01/ 2022, doi: https://doi.org/10.1016/j.birob.2022.100062.
[11] U. Azmat et al., "Aerial Insights: Deep Learning-Based Human Action Recognition in Drone Imagery," IEEE Access, vol. 11, pp. 83946-83961, 2023, doi: 10.1109/ACCESS.2023.3302353.
[12] M. Batool et al., "Depth Sensors-Based Action Recognition Using a Modified K-Ary Entropy Classifier," IEEE Access, vol. 11, pp. 58578-58595, 2023, doi: 10.1109/ACCESS.2023.3260403.
[13] W. Wang and Y.-D. Zhang, "A short survey on deep learning for skeleton-based action recognition," presented at the Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion, Leicester, United Kingdom, 2022. [Online]. Available: https://doi.org/10.1145/3492323.3495571.
[14] S. Yan, Y. Xiong, and D. Lin, "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition," p. arXiv:1801.07455doi: 10.48550/arXiv.1801.07455.
[15] Q. Zhao, C. Zheng, M. Liu, P. Wang, and C. Chen, "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation," p. arXiv:2303.17472doi: 10.48550/arXiv.2303.17472.
[16] Y. Li et al., "TokenPose: Learning Keypoint Tokens for Human Pose Estimation," p. arXiv:2104.03516doi: 10.48550/arXiv.2104.03516.
[17] A. Sengupta, F. Jin, R. Zhang, and S. Cao, "mm-Pose: Real-Time Human Skeletal Posture Estimation Using mmWave Radars and CNNs," IEEE Sensors Journal, vol. 20, no. 17, pp. 10032-10044, 2020, doi: 10.1109/JSEN.2020.2991741.
[18] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, "Graph convolutional networks: a comprehensive review," Computational Social Networks, vol. 6, no. 1, p. 11, 2019/11/10 2019, doi: 10.1186/s40649-019-0069-y.
[19] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, "Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2613-2622, 2017, doi: 10.1109/TCSVT.2016.2576761.
[20] S. Bai, J. Zico Kolter, and V. Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling," p. arXiv:1803.01271doi: 10.48550/arXiv.1803.01271.
[21] F. Kikukawa, S. Ishihara, M. Soga, and H. Taki, "Development of a learning environment for playing erhu by diagnosis and advice regarding finger position on strings," Proceedings of the 2013 International Conference on New Interfaces for Musical Expression, pp. 271-276, 01/01 2013.
[22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," p. arXiv:1703.06870doi: 10.48550/arXiv.1703.06870.
[23] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, "YOLACT: Real-time Instance Segmentation," p. arXiv:1904.02689doi: 10.48550/arXiv.1904.02689.
[24] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep High-Resolution Representation Learning for Human Pose Estimation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686-5696, 2019.
[25] C. Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," p. arXiv:1906.08172doi: 10.48550/arXiv.1906.08172.
[26] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields," p. arXiv:1812.08008doi: 10.48550/arXiv.1812.08008.
[27] H. H. Lee, How to Play Erhu, the Chinese Violin: The Basic Skills. China: Independently published, 2017.
[28] S. J. G. Bayarsaikhan B., Morin khuur self learning book. Mongolia: Independently published, 2003.

簡易檢索 / 詳目顯示

相關論文