| 研究生: |
楊恕先 Shu-Sian Yang |
|---|---|
| 論文名稱: |
基於卷積神經網路之語音辨識 Speech Recognition by Using Convolutional Neural Network |
| 指導教授: | 莊堯棠 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 64 |
| 中文關鍵詞: | 語音辨識 、深度學習 、神經網路 |
| 外文關鍵詞: | speech recognition, deep learning, neural network |
| 相關次數: | 點閱:9 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文在探討如何利用深度學習來進行語音辨識,而使用的辨識方法是先透過梅爾倒頻譜係數((Mel frequency cepstral coefficients, MFCCs)取得語音特徵參數,並輸入卷積神經網路(Convolutional Neural Network, CNN)進行語音辨識。
此法與傳統語音辨識方法最大不同是在於不需要建立聲學模型,以中文為例就省去建立大量聲母(consonant)、韻母(vowel)比對的時間。藉由透過MFCCs取得特徵參數後就可以透過卷積神經網路實現語音辨識,並且不會受到語言種類的限制。
The thesis developed a speech recognition method for automatic speech recognition. In this speech recognition method, we obtained the speech feature parameters through Mel frequency cepstral coefficients and input a Convolutional Neural Network. The main difference between this Convolutional Neural Network speech recognition method and traditional speech recognition method is that it does not need to establish an acoustic model. For example, in Chinese, it saved a lot of time without establishing a large number of consonant and vowel models. After obtaining the speech feature parameters through the MFCCs, speech recognition is finished through Convolutional Neural Network.
[1] Anjali, A. Kumar and N. Birla, Voice Command Recognition System based on MFCC and DTW, International Journal of Engineering Science and Technology, 2(12),2010.
[2] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2011, pp. 5060–5063
[3] B.Y. Chen, Q. Zhu, and N. Morgan, “A Neural Network for Learning Long-Term Temporal Features for Speech Recognition,” Proc. ICASSP 2005, March 2005, pp. 945-948
[4] Corneliu Octavian Dumitru, Inge Gavat, “A Comparative Study of Feature Extraction Methods Applied to Continuous Speech Recognition in Romanian Language,” International Symphosium ELMAR, 07-09 June, 2006, Zadar, Croatia
[5] C. Poonkuzhali, R. Karthiprakash, S. Valarmathy and M. Kalamani, An Approach to feature selection algorithm based on Ant Colony Optimization for Automatic Speech Recognition, International journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 11(2), and 2013.
[6] C. Ittichaichareon, S. Suksri and T. Yingthawornsuk, speech Recognition using MFCC, International Conference on Computer Graphics Simulation and Modeling, 2012.
[7] C. Kim and R. M. Stern, “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”, in Proc. ICASSP, pp. 4574–4577, 2010.
[8] C. Charbuillet, B. Gas, M. Chetouani and J. L. Zarader, "Complementary features for speaker verification based on genetic algorithms," IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4 2007 pp. IV-285 - IV-288.
[9] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature learning in deep neural networks - studies on speech recognition tasks,” in Proc. Int. Conf. Learn. Represent., 2013.
[10]Diederik P. Kingma and Jimmy Lei Ba “A METHOD FOR
STOCHASTIC OPTIMIZATION” ICLR 2015.
[11] D.C.Cire¸san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classification. Arxiv preprint arXiv:1102.0183, 2011.
[12] D. Cire¸san, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Arxiv preprint arXiv:1202.2745, 2012.
[13] E. Bocchieri and D. Dimitriadis “Investigating deep neural network based transforms of robust audio features for LVCSR” in Proc. ICASSP, pp. 6709–6713, 2013.
[14] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE Workshop Autom. Speech Recognition Understand. (ASRU), 2011, pp. 24–29.
[15] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
using context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 437–440.
[16] H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Proc. Adv. Neural Inf. Process. Syst. 22, 2009, pp. 1096–1104.
[17] H. Franco, M. Graciarena, and A. Mandal, “Normalized amplitude modulation features for large vocabulary noise-robust speech recognition”, Proc. ICASSP 2012, pp. 4117-4120, March 2012
[18] J. Chen , K. K. Paliwal, M. Mizumachi and S. Nakamura, “Robust mfccs derived from differentiated power spectrum” Eurospeech 2001, Scandinavia, 2001.
[19] J.C.Wang,J.F.Wang,Y.S.Weng, “Chip design of MFCC extraction for speech recognition Volume 32 ,“ Issues 1–2, pp. 111-131, November 2002.
[20] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping(DTW) Techniques, Journal of Computing, 3(2),2010
[21] L. Deng, K. Hassanein, and M. Elmasry, “Analysis of correlation structure for a neural predictive model with applications to speech recognition,” Neural Netw., vol. 7, no. 2, pp. 331–339, 1994.
[22]L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060–1089, May 2013.
[23]M.A.Anusuya and S.K.Katti, “Speech Recognition by Machine: A Review”, (IJCSIS) International Journal of Computer Science and Information Security, vol. 6, no. 3, pp. 181-205, 2009.
[24]M. Kleinschmidt, “Localized spectro-temporal features for automatic speech recognition,” in Proc. of Eurospeech, 2003, Sep 2003, pp. 2573–2576.
[25]N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 7–13, Jan. 2012.
[26]Ossama Abdel-Hamid, Li Deng and Dong Yu, “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition, “ Interspeech, pp. 3366-3370, August 2013.
[27]Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Convolutional Neural Networks for Speech Recognition, “IEEE/ACM Transaction On Audio, Speech, and Language Processing, Vol. 22, No. 10, October 2014.
[28]Ovidiu Buza1, Gavril Toderean1, Alina Nica1, Alexandru Caruntu1, “Voice Signal Processing For Speech Synthesis,” IEEE International Conference on Automation, Quality and Testing Robotics, Vol. 2, pp. 360-364, 25-28 May 2006.
[29]Parwinder Pal Singh and Pushpa Rani, “An Approach to Extract Feature using MFCC,” International organization of Scientific Research, Volume .04,pp.21-25, August 2014.
[30]P. C. Woodland and D. Povey, “Large scale discriminative training of
hidden Markov models for speech recognition,” Computer Speech
and Language, vol. 16, no. 1, pp. 25–47, 2002.
[31]Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in Machine Learning for Multimodal Interaction. Berlin/Heidelberg, Germany: Springer , 2005, vol. 3361, pp. 223–231.
[32]Rajesh Kumar Aggarwal and M. Dave, “Acoustic modeling problem for automatic speech recognition system: advances and refinements Part (Part II)”, Int J Speech Technol, pp. 309– 320, 2011.
[33]Shuo-Yiin Chang and Nelson Morgan, “Robust CNN-based Speech Recognition With Gabor Filter Kernels, “ Interspeech, pp. 905-909, September 2014.
[34]Sheeraz Memon, Margaret Lech and Ling He, "Using information theoretic vector quantization for inverted mfcc based speaker verification," 2nd International Conference on Computer, Control and Communication, 2009. IC4 2009, pp. 1 – 5.
[35]S. Witt and S. Young, “Phone-level pronunciation scoring and
assessment for interactive language learning,” Speech
Communication, vol. 30, no. 2–3, pp. 95–108, 2000.
[36]S. Dhingra, G. Nijhawan and P. Pandit, Isolated Speech Recognition using MFCC and DTW, International journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 2013.
[37] S. Chakroborty and S. Goutam, “Improved Text-Independent Speaker Identification using Fused MFCC & IMFCC Feature Sets based on Gaussian Filter,” International Journal of Signal Processing, Vol.5, pp. 1-9, 2009.
[38]S.Y. Chang, N. Morgan “Informative spectro-temporal bottleneck features for noise-robust speech recognition”, Proc. Interspeech 2013
[39]T. Landauer, C. Kamm, and S. Singhal, “Learning a minimally structured back propagation network to recognize speech,” in Proc. 9th Annu. Conf. Cogn. Sci. Soc., 1987, pp. 531–536.
[40]W. Han, C. F. Chan, C. S. Choy and K. P. Pun, “An Efficient MFCC
Extraction Method in Speech Recognition,” International Symposium on Circuits and Systems, pp. 21-24, 2006.
[41]Wang Chen, Miao Zhenjiang and Meng Xiao, "Comparison of different implementations of mfcc," J. Computer Science & Technology, 2001, pp. 16(16): 582-589.
[42]Wang Chen, Miao Zhenjiang and Meng Xiao, "Differential mfcc and vector quantization used for real-time speaker recognition system," Congress on Image and Signal Processing, 2008, pp. 319 - 323.
[43]Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time-series,” in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995.