| 研究生: |
凌欣暉 xing-hung lan |
|---|---|
| 論文名稱: |
強健性語音辨識及語者確認之研究 A Study of Robust Speech Recognition and Speaker Verification |
| 指導教授: |
莊堯棠
Yau-Tarng Juang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 271 |
| 中文關鍵詞: | 語音辨識 、語者確認 、支撐向量機 、強健特徵參數 、關鍵詞萃取 |
| 外文關鍵詞: | Keyword Spotting, Speech Recognition, speaker verification, Support vector machine, robust features |
| 相關次數: | 點閱:7 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文可分為三個部分:關鍵詞萃取、特徵參數統計值正規化法及語者確認。在關鍵詞萃取方面, 採用次音節中的右相關音素模型串連來產生關鍵詞與無關詞模組
語音辨識系統經常因環境不匹配的影響而使辨識率大幅的下降,特徵參數統計值正規化技術有低複雜度及運算快速的優點,本論文以ARUORA 2語料庫來評估效能,統計圖等化法結合ARMA低通濾波器可將統計圖等化法之辨識率由84.93%提升至86.37%,而使用統計圖等化法結合調適性ARMA濾波器則可提升至86.91%。
語者確認系統是利用參數核函數結合高斯混合模型及支撐向量機模型,藉以提升系統效能。使用各語者的高斯混合模型參數建立超級向量,以雜訊屬性補償(NAP)修正超級向量,在訓練階段中,需將超級向量做正規化,之後利用正規化後的超級向量訓練SVM模型。而在仿冒者的選取上,則是選取與目標語者特徵最相似的前n名仿冒語音,使得訓練出來的SVM 模型更有鑑別力。而測試時以測試分數正規化技術調整距離值。從NIST 2001語料庫實驗結果顯示,64mixture的參數核函數(NAP)結合測試分數正規化之確認系統可達最好的相等錯誤率及決策成本函數分別為4.17%及0.0491。
This thesis consists of three main parts:Keyword Spotting、Cepstral Feature normalization and speaker verification.In the Keyword Spotting, the use of sub-syllable models to establish the keyword and filler module.
Environment mismatch is the major source of performance degradation in speech recognition. Cepstral Feature normalization Technique has been popularly used as a powerful approach to produce robust features. A common advantage of these methods is its low computation complexity. The experimental results on Aurora 2 database had shown that the Histogram Equalization and ARMA filter front-end achieved 86.37%, and Histogram Equalization and Adaptive ARMA filter front-end achieved achieved 86.91% digit recognition rates.
The speaker verification combines the Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) with Kernel Function.From the UBM, we can use map to get the parameters of the GMM. We used the new features to establish target supervector and imposter supervector,then we do the NAP process to modify supervector. In the train stage, we used the target supervector and imposter supervector to train SVM model. About the imposters selection, we choose the top n speaker’s whose characteristics are similar to the target which can let the model become more discriminative.In the testing stage, we used the test normalization to adjust the distance.From the experiment on NIST 2001 SRE, we can find 64mixture parametric kernel combined with result in better EER and DCF which are 4.17% and 0.0491 respectively.
[1] J. W. Huang, J. L. Shen, and L. S. Lee, “New Approaches for Domain Transformation and Parameter Combination(PMC)Techniques”, IEEE Trans. on Speech and Audio Processing, Vol. 9, No. 8, Nov. 2001.
[2] S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE Transaction on Acoustics, Speech and Signal Processing, 29, pp. 254.272, 1981.
[3] A. E. Rosenberg, C. H. Lee and F. K. Soong, “Cepstral channel normalization techniques for HMM-based speaker verification”, ICSLP, pp.1835-1838, 1994.
[4] O. Viikki, and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition”, Speech Communication, 25, pp. 133-147, 1998.
[5] O. Vikki and K. Laurila, “Noise robust HMM-based speech recognition using segmental cepstral feature vector norlization”, ESCA NATO Workshop Robust Speech Recognition Unknown Communication Channels, France, pp.107-110, 1997.
[6] H. Hermansky and N. Morgan, “RASTA processing of speech”, IEEE Transaction on Speech and Audio Processing, 2, pp. 578-589, 1994.
[7] S. H. Lin, “Exploring the Use of Data Fitting and Clustering Techniques for Robust Speech Recognition”, Master Thesis, Department of Computer Science and Information Engineering, National Taiwan Normal University, Taiwan,2006
[8] L. S. Lee, and Y. Lee, “Voice Access of Global Information for Broad-Band Wireless: Technologies of Today and Challenges of Tomorrow”, Proceedings of the IEEE, vol. 89, no. 1, pp. 41-57,January 2001.
[9] S. Furui, “An overview of speaker recognition technology”, ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, page 1-9, 1994.
[10] S. Furui, “Recent advances in speaker recognition”, Pattern Recognition Letters, pp. 859-872, 1997
[11] A. Solomonoff, W. M. Campbell, and I. Boardman, “Advances in channel compensation for SVM speaker recognition”, in Proceedings of ICASSP, 2005.
[12] R. Auckenthaler, M. Carey and H. Lloyd-Thomas, “Score normalization for text-independent speaker verification systems”, Digital Signal Processing, 10, pp. 42-54, 2000.
[13] D. E. Sturim and D. A. Reynolds, “Speaker adaptive cohort selection for Tnorm in text-independent speaker verification”, Proc. ICASSP’05, pp. I-741 – I-744, 2005.
[14] R. O. Duda, P. E. Hart, and D. G. Stork. “Pattern Classification”, Wiley, New York, 2nd edition, 2000.
[15] X. Huang, A. Acero, H. W. Hon, “Spoken language processing”, Prentice Hall, 2001.
[16] B. H. Juang, “ The past, present, and future of speech processing ”, IEEE Trans. on Signal Processing, pp. 24-28, May 1998.
[17] K. C. Huang, Y. T. Juang and W. C. Chang, “ Robust integration for speech features”, Signal Processing Volume: 86, Issue: 9, September, 2006, pp. 2282-2288(SCI) , September 2006
[18] L. R. Rabiner and B. H. Juang, “ Fundamentals of Speech Recognition” , Prentice Hall, New Jersey, 1993.
[19] D. Burshtein, “ Robust parametric modeling of duration in hidden Markov models”, IEEE Trans. on Speech Audio Processing, vol. 4, pp. 240-242, May 1996.
[20] R. Vergin, et al., "Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition", Speech and Audio Processing, IEEE Transactions on, vol. 7, pp. 525-532, 1999.
[21] S. Imai, "CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE", 1983 IEEE, Tokyo Institute of Technology Nagatsuta—cho, Midori-ku, Yokohama 227 Japan
[22] B.S. Atal and L.R. Rabiner,, "A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-24, NO. 3, JUNE 1976
[23] H. K. Kim, S. H. Choi, and H. S. Lee, "On Approximating Line Spectral Frequencies to LPC Cepstral Coefficients," IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
[24] JSD Mason and Y Gu, "Perceptually-based Features in ASR", University College of Swansea, UK
[25] H. Hermansky, “Perceptual linear predictive (PLP)analysis of speech,” Journal of the Acoustic Society of America, vol. 87, issue 4, pp. 1738-1752, Apr. 1990
[26] L. R. Rabiner and R. W. Schafer, “Digital processing of speech recognition signals,” Prentice-Hall Co. Ltd, 1978.
[27] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the application of the theory of probabilistic function of a markov process to automatic speech recognition,” The Bell System Technical Journal, vol. 62, no. 4, April 1983.
[28] L. R. Rabiner, “ A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition, ” Proceedings of the IEEE, vol. 77, No. 2, Feb. 1989.
[29] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “ An Introduction to the Application of the Theory of Probabilistic Function of a Markov Process to Automatic Speech Recognition, ” The Bell System Technical Journal, vol. 62, No. 4, April 1983.
[30] L.R.Bahl, F. Jelinek and R. L. Mercer . “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5, N0.2, pp.179-190, March 1983.
[31] L.E.Baum.“An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, Vol. 3, No. 1, pp.1-8, 1972.