基於卷積與孿生神經網路之語者辨識系統｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳秉揚 Ping-Yang Chen
論文名稱：	基於卷積與孿生神經網路之語者辨識系統 A Speaker Recognition System Based on Convolution and Siamese Neural Network
指導教授：	莊堯棠
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	77
中文關鍵詞：	卷積神經網路、孿生神經網路、語者辨識
相關次數：	點閱：11 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

語者辨識系統根據應用領域的不同可以區分為語者識別(Speaker Identification)及語者確認(Speaker Verification)兩個類別。本論文設計一個卷積神經網路(Convolution Neural Network)架構用於語者識別，透過卷積神經網路區分不同語者之間的語音特徵，並且將比較在不同的初始學習速率、權重初始化方法以及特徵擷取方法下，語者識別模型效果之差異。對於語者確認模型則會利用孿生神經網路(Siamese Neural Network)來實現，透過計算基準語者與測試語者在特徵空間中的距離，進而判斷兩位語者是否相似。最後，會把語者識別與語者確認模型介面化，讓使用者能方便使用。

The speaker recognition system can be divided into two categories: “Speaker Identification” and “Speaker Verification”. This thesis designs a convolutional neural network architecture for speaker identification in order to distinguishing the speech features between different speakers, and compare the effect of the speaker identification model under different initial learning rates, weight initialization methods, and feature extraction methods. As for the speaker verification model, it’s implemented by using the siamese neural network. The siamese neural network determines whether the two speakers are similar by calculating the distance between the base speaker and the input speaker in the discriminative feature space. Finally, we design a graphical user interface for user to use.

摘要    I
ABSTRACT    II
誌謝    III
目錄    III
圖目錄    VII
表目錄    VIII
第1章    緒論    1
1.    研究動機    1
2.    文獻探討    2
3.    章節摘要    3
第2章    語者辨識系統與技術    5
1.    語者辨識性質    6
1.1.    語者識別(Speaker Identification)    6
1.2.    語者確認(Speaker Verification)    6
2.    輸入語句之內容    7
2.1.    文本相關(Text-dependent)    8
2.2.    文本不相關(Text-independent)    8
3.    前處理    8
3.1.    預強調(Pre-emphasis)    9
3.2.    音框化(Frame Blocking)    10
3.3.    漢明窗(Hamming Window)    10
3.4.    快速傅立葉轉換(Fast Fourier Transform , FFT)    11
4.    特徵值擷取    12
4.1.    三角帶通濾波器(Triangular Bandpass Filter)    12
4.2.    對數能量(Log Energy)    13
4.3.    離散餘弦轉換(Discrete Cosine Transform , DCT)    14
4.4.    差量倒頻譜參數(Delta-cepstral coefficients)    14
第3章    深度學習    16
1.    機器學習的種類    16
1.1.    監督學習(Supervised Learning)    16
1.2.    非監督學習(Un-supervised Learning)    17
1.3.    增強學習(Reinforcement Learning)    17
1.4.    半監督學習(Semi-supervised Learning)    18
2.    卷積神經網路(CONVOLUTION NEURAL NETWORK)    18
2.1.    卷積層(Convolution Layer)    20
2.2.    激活函數(Activation Function)    22
2.3.    池化層(Pooling Layer)    25
2.4.    全連接層(Fully-connected Layer)    26
3.    孿生神經網路(SIAMESE NEURAL NETWORK)    27
第4章    優化方法及開發平台    30
1.    優化方法    30
1.1.    Dropout層    30
1.2.    學習速率    31
1.3.    權重初始化(Weight Initialization)    31
1.4.    Batch Normaliztion    32
2.    開發平台    34
2.1.    Tensorflow    34
2.2.    Keras    35
第5章    實驗結果與實作系統    37
1.    實驗環境與設備介紹    37
2.    語音資料庫與特徵擷取    38
2.1.    語音資料庫    38
2.2.    特徵參數    39
3.    語者識別實驗    40
3.1.    實驗一 初始學習速率對於模型效率影響之實驗    41
3.2.    實驗二 特徵擷取與權重初始化方法之比較    42
3.3.    實驗三 池化層對於模型影響之探討    45
4.    語者確認實驗    47
5.    實作系統    49
5.1.    語者識別系統    49
5.2.    語者確認系統    52
第6章    結論與未來展望    56
1.    結論    56
2.    未來展望    57
第7章    參考文獻    58


                                

[1] 呂易宸, “語音門禁系統,” 中央大學電機工程學系碩士論文, 民國100年.
[2] H. Sakoe and S. Chiba,“Dynamic programming algorithm optimization for spoken word recognition,”Acoustics, Speech and Signal Processing, IEEE Transactions on,vol.26,pp.43-49,1978.
[3] C. S. Myers and L. R. Rabiner, “A level building dynamic time warping algorithm for connected word recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. ASSP-29, pp. 284-297, Apr. 1981.
[4] Cory Myers, Lawrence R. Rabiner, Aaron E. Rosenberg, “Performance Tradeoffs in Dynamic Time Warping Algorithms for Isolated Word Recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, Vol. Assp-28, No. 6, December 1980.
[5] Mark Gales and Steve Young ,“The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends in Signal Processing Vol. 1, No. 3 pp. 195–304, 2007.
[6] L. Rabiner. A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In: Proceedings of IEEE Volume 77 No. 2 pp 257-286, February 1989.
[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” ELSEVIER, Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[8] S. Fine, J. Navratil and R. A. Gopinath, “A hybrid GMM/SVM approach to speaker identification,” IEEE Transactions, Acoustics Speech and Signal Processing, vol. 1, pp. 417-420, 2001.
[9] E. Rodriguez, B. Ruiz, A. G. Crespo, F. Garcia. “Speech/Speaker Recognition Using a HMM/GMM Hybrid Model. “ In: Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication, pp. 227- 234, April 2003
[10] E. Trentin and M. Gori,“A survey of hybrid ANN/HMM models for automatic speech recognition,” Neurocomputing,vol.37,no.1,pp.91-126,2001.
[11] Mohamad Adnan Al-Alaoui, Lina Al-Kanj, Jimmy Azar, and Elias Yaacoub,“Speech Recognition using Artificial Neural Networks and Hidden Markov Models,”IEEE Multidisciplinary Engineering Education Magazine, Vol.3, pp.77-86, September 2008.
[12] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[13] A.R. Mohamed, G.E. Dahl, and G. Hinton , “Acoustic Modeling Using Deep Belief Networks,” IEEE Transactions on Audio, Speech, and Language Processing,vol.20,no.1,pp.14-22,2012.
[14] T.N. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, "Deep convolutional neural networks for LVCSR", Proc IEEE ICASSP, 2013.
[15] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional neural networks for speech recognition", IEEE Transactions on Audio Speech and Language Processing, vol. 22, no. 1, pp. 1533-1545, 2014.
[16] Jui-Ting Huang, J. Li, Y. Gong, "An analysis of convolutional neural networks for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2015.
[17] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. J. Cowan and G. Tesauro (eds) Advances in Neural Information Processing Systems, 1993.
[18] S. Chopra, R. Hadsell and Y. LeCun, “ Learning a similarity metric discriminatively, with application to face verification,”In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol.1, pp. 539–546, 2005.
[19] G. Koch, R. Zemel and R. Salakhutdinov, “ Siamese neural networks for one-shot image recognition,” ICML Deep Learning Workshop. vol. 2 (2015).
[20] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “ Fully-convolutional siamese networks for object tracking,” In European Conference on Computer Vision Workshop, pp. 850–865. Springer, 2016.
[21] 王小川，“語音訊號處理”，全華，民國93年。
[22] .D.A. Reynolds. “Speaker Identification and Verification using Gaussian Mixture Speaker Models”. Speech Communication, V. 17, pp. 177-192, 1995.
[23] S.Furui, “An Overview of Speaker Recognition Technology,” Workshop on Automatic Speaker Recognition, Identification, pp. 1–9, 1994.
[24] D.Burton, “Text Dependent Speaker Verification Using Vector Quantization Source Coding,” Transactions on Acoustics, Speech and Signal Processing, vol.35, pp. 133-143, 1987.
[25] A.Roland and C.Michael and L.T.Harvey, “Score Normalization for Text Independent Speaker Verification Systems,” ScienceDirect Digital Signal Processing, vol.10, pp. 42-54, 2000.
[26] 郭又禎, “改良式梅爾倒頻譜參數應用於關鍵字萃取,”中央大學電機工程學系碩士論文, 民國103年.
[27] J. R. Deller, J. G. Proakis and J. H. L. Hansen, “Discrete-time Processing of Speech Signals,” Wiley-IEEE Press, 1999.
[28] R. Vergin, D. OShaughnessy and A. Farhat, “Generalized Mel Frequency Cepstral Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition,” IEEE Transactions On Speech And Audio Processing, Vol. 7, NO. 5, 1999.
[29] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Acoustical Society of America Journal, vol. 87, pp.1738–1752, 1990.
[30] S. Ravuri and A. Stolcke, “Recurrent neural network and LSTM models for lexical utterance classification,” in Interspeech, 2015.
[31] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding,” in In Prooceedings of the Interspeech, Lyon, France, August 2013.
[32] R. Sathya and A. Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification,” (IJARAI) International Journal of Advanced Research in Artificial Intelligence, vol. 2, no. 2,2013.
[33] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” J. Artif. Intell. Res. 4, pp. 237-285, 1996.
[34] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-Supervised Learning,” MIT Press, 2007.
[35] A. Subramanya and J. Bilmes, “Semi-Supervised Learning with Measure Propagation,” Journal of Machine Learning Research, 2011.
[36] J. Wu, “Introduction to Convolutional Neural Networks,” 2017.
[37] 斎藤康毅，“Deep Learning：用Python進行深度學習的基礎理論實作,”吳嘉芳譯，碁峰資訊，2017.
[38] S. Wager, S. Wang, and P. Liang, “Dropout training as adaptive regularization,” In Advances in Neural Information Processing Systems 26, pp. 351–359, 2013.
[39] Ｎ. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, pp. 1929-1958,2014.
[40] N. Srivastava, “Improving Neural Networks with Dropout,” Master’s thesis, University of Toronto, January 2013.
[41] Leslie N. Smith, “Cyclical Learning Rates for Training Neural Networks,” U.S. Naval Research Laboratory,2015.
[42] Bo Yang Hsueh, Wei Li and I-Chen Wu, “Stochastic Gradient Descent with Hyperbolic-Tangent Decay,” Computer Vision and Pattern Recognition,2015.
[43] M.D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[44] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010.
[45] K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” IEEE International Conference On Computer Vision,2015.
[46] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs.LG],2015.
[47] J. Bjorck, C. Gomes, B. Selman and KQ. Weinberger, “Understanding Batch Normalization,” arXiv:1806.02375 [cs.LG],2018.
[48] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris ´ Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol ´ Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
[49] K. Wongsuphasawat, D.Smilkov, J. Wexler, J. Wilson, D. Mane, D. Fritz, D. Krishnan, F.B. Viegas and M. Wattenberg , “Visualizing Dataflow Graphs of Deep Learning Models in TensorFlow,” IEEE Transactions on Visualization and Computer Graphics, vol. 24,no .1, pp.1-12,2017.
[50] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in INTERSPEECH, 2017.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[52] A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pp. 4273–4276, 2012.

簡易檢索 / 詳目顯示

相關論文