| 研究生: |
陳秉揚 Ping-Yang Chen |
|---|---|
| 論文名稱: |
基於卷積與孿生神經網路之語者辨識系統 A Speaker Recognition System Based on Convolution and Siamese Neural Network |
| 指導教授: | 莊堯棠 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 77 |
| 中文關鍵詞: | 卷積神經網路 、孿生神經網路 、語者辨識 |
| 相關次數: | 點閱:11 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語者辨識系統根據應用領域的不同可以區分為語者識別(Speaker Identification)及語者確認(Speaker Verification)兩個類別。本論文設計一個卷積神經網路(Convolution Neural Network)架構用於語者識別,透過卷積神經網路區分不同語者之間的語音特徵,並且將比較在不同的初始學習速率、權重初始化方法以及特徵擷取方法下,語者識別模型效果之差異。對於語者確認模型則會利用孿生神經網路(Siamese Neural Network)來實現,透過計算基準語者與測試語者在特徵空間中的距離,進而判斷兩位語者是否相似。最後,會把語者識別與語者確認模型介面化,讓使用者能方便使用。
The speaker recognition system can be divided into two categories: “Speaker Identification” and “Speaker Verification”. This thesis designs a convolutional neural network architecture for speaker identification in order to distinguishing the speech features between different speakers, and compare the effect of the speaker identification model under different initial learning rates, weight initialization methods, and feature extraction methods. As for the speaker verification model, it’s implemented by using the siamese neural network. The siamese neural network determines whether the two speakers are similar by calculating the distance between the base speaker and the input speaker in the discriminative feature space. Finally, we design a graphical user interface for user to use.
[1] 呂易宸, “語音門禁系統,” 中央大學電機工程學系碩士論文, 民國100年.
[2] H. Sakoe and S. Chiba,“Dynamic programming algorithm optimization for spoken word recognition,”Acoustics, Speech and Signal Processing, IEEE Transactions on,vol.26,pp.43-49,1978.
[3] C. S. Myers and L. R. Rabiner, “A level building dynamic time warping algorithm for connected word recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. ASSP-29, pp. 284-297, Apr. 1981.
[4] Cory Myers, Lawrence R. Rabiner, Aaron E. Rosenberg, “Performance Tradeoffs in Dynamic Time Warping Algorithms for Isolated Word Recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, Vol. Assp-28, No. 6, December 1980.
[5] Mark Gales and Steve Young ,“The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends in Signal Processing Vol. 1, No. 3 pp. 195–304, 2007.
[6] L. Rabiner. A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In: Proceedings of IEEE Volume 77 No. 2 pp 257-286, February 1989.
[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” ELSEVIER, Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[8] S. Fine, J. Navratil and R. A. Gopinath, “A hybrid GMM/SVM approach to speaker identification,” IEEE Transactions, Acoustics Speech and Signal Processing, vol. 1, pp. 417-420, 2001.
[9] E. Rodriguez, B. Ruiz, A. G. Crespo, F. Garcia. “Speech/Speaker Recognition Using a HMM/GMM Hybrid Model. “ In: Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication, pp. 227- 234, April 2003
[10] E. Trentin and M. Gori,“A survey of hybrid ANN/HMM models for automatic speech recognition,” Neurocomputing,vol.37,no.1,pp.91-126,2001.
[11] Mohamad Adnan Al-Alaoui, Lina Al-Kanj, Jimmy Azar, and Elias Yaacoub,“Speech Recognition using Artificial Neural Networks and Hidden Markov Models,”IEEE Multidisciplinary Engineering Education Magazine, Vol.3, pp.77-86, September 2008.
[12] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[13] A.R. Mohamed, G.E. Dahl, and G. Hinton , “Acoustic Modeling Using Deep Belief Networks,” IEEE Transactions on Audio, Speech, and Language Processing,vol.20,no.1,pp.14-22,2012.
[14] T.N. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, "Deep convolutional neural networks for LVCSR", Proc IEEE ICASSP, 2013.
[15] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional neural networks for speech recognition", IEEE Transactions on Audio Speech and Language Processing, vol. 22, no. 1, pp. 1533-1545, 2014.
[16] Jui-Ting Huang, J. Li, Y. Gong, "An analysis of convolutional neural networks for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2015.
[17] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. J. Cowan and G. Tesauro (eds) Advances in Neural Information Processing Systems, 1993.
[18] S. Chopra, R. Hadsell and Y. LeCun, “ Learning a similarity metric discriminatively, with application to face verification,”In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol.1, pp. 539–546, 2005.
[19] G. Koch, R. Zemel and R. Salakhutdinov, “ Siamese neural networks for one-shot image recognition,” ICML Deep Learning Workshop. vol. 2 (2015).
[20] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “ Fully-convolutional siamese networks for object tracking,” In European Conference on Computer Vision Workshop, pp. 850–865. Springer, 2016.
[21] 王小川,“語音訊號處理”,全華,民國93年。
[22] .D.A. Reynolds. “Speaker Identification and Verification using Gaussian Mixture Speaker Models”. Speech Communication, V. 17, pp. 177-192, 1995.
[23] S.Furui, “An Overview of Speaker Recognition Technology,” Workshop on Automatic Speaker Recognition, Identification, pp. 1–9, 1994.
[24] D.Burton, “Text Dependent Speaker Verification Using Vector Quantization Source Coding,” Transactions on Acoustics, Speech and Signal Processing, vol.35, pp. 133-143, 1987.
[25] A.Roland and C.Michael and L.T.Harvey, “Score Normalization for Text Independent Speaker Verification Systems,” ScienceDirect Digital Signal Processing, vol.10, pp. 42-54, 2000.
[26] 郭又禎, “改良式梅爾倒頻譜參數應用於關鍵字萃取,”中央大學電機工程學系碩士論文, 民國103年.
[27] J. R. Deller, J. G. Proakis and J. H. L. Hansen, “Discrete-time Processing of Speech Signals,” Wiley-IEEE Press, 1999.
[28] R. Vergin, D. OShaughnessy and A. Farhat, “Generalized Mel Frequency Cepstral Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition,” IEEE Transactions On Speech And Audio Processing, Vol. 7, NO. 5, 1999.
[29] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Acoustical Society of America Journal, vol. 87, pp.1738–1752, 1990.
[30] S. Ravuri and A. Stolcke, “Recurrent neural network and LSTM models for lexical utterance classification,” in Interspeech, 2015.
[31] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding,” in In Prooceedings of the Interspeech, Lyon, France, August 2013.
[32] R. Sathya and A. Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification,” (IJARAI) International Journal of Advanced Research in Artificial Intelligence, vol. 2, no. 2,2013.
[33] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” J. Artif. Intell. Res. 4, pp. 237-285, 1996.
[34] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-Supervised Learning,” MIT Press, 2007.
[35] A. Subramanya and J. Bilmes, “Semi-Supervised Learning with Measure Propagation,” Journal of Machine Learning Research, 2011.
[36] J. Wu, “Introduction to Convolutional Neural Networks,” 2017.
[37] 斎藤康毅,“Deep Learning:用Python進行深度學習的基礎理論實作,”吳嘉芳譯,碁峰資訊,2017.
[38] S. Wager, S. Wang, and P. Liang, “Dropout training as adaptive regularization,” In Advances in Neural Information Processing Systems 26, pp. 351–359, 2013.
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, pp. 1929-1958,2014.
[40] N. Srivastava, “Improving Neural Networks with Dropout,” Master’s thesis, University of Toronto, January 2013.
[41] Leslie N. Smith, “Cyclical Learning Rates for Training Neural Networks,” U.S. Naval Research Laboratory,2015.
[42] Bo Yang Hsueh, Wei Li and I-Chen Wu, “Stochastic Gradient Descent with Hyperbolic-Tangent Decay,” Computer Vision and Pattern Recognition,2015.
[43] M.D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[44] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010.
[45] K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” IEEE International Conference On Computer Vision,2015.
[46] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs.LG],2015.
[47] J. Bjorck, C. Gomes, B. Selman and KQ. Weinberger, “Understanding Batch Normalization,” arXiv:1806.02375 [cs.LG],2018.
[48] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris ´ Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol ´ Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
[49] K. Wongsuphasawat, D.Smilkov, J. Wexler, J. Wilson, D. Mane, D. Fritz, D. Krishnan, F.B. Viegas and M. Wattenberg , “Visualizing Dataflow Graphs of Deep Learning Models in TensorFlow,” IEEE Transactions on Visualization and Computer Graphics, vol. 24,no .1, pp.1-12,2017.
[50] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in INTERSPEECH, 2017.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[52] A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pp. 4273–4276, 2012.