跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳子和
Zi-He Chen
論文名稱: 利用韻律訊息之強健性語者辨識
Latent Prosody Analysis for Robust Speaker Identification
指導教授: 廖元甫
Yuan-Fu Liao
莊堯棠
Yau-Tarng Juang
口試委員:
學位類別: 博士
Doctor
系所名稱: 資訊電機學院 - 電機工程學系
Department of Electrical Engineering
畢業學年度: 95
語文別: 英文
論文頁數: 77
中文關鍵詞: 語者辨識韻律訊息
外文關鍵詞: speaker identification, prosodic information
相關次數: 點閱:11下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在公共電話網路中,語者辨認系統通常會遇到話筒不匹配和辨認語料不足的問題。為增進語者辨認系統之強健性,我們提出一融合下層聲學與上層韻律訊息之架構,利用韻律訊息特徵分析(latent prosody analysis, LPA),量測不同語者間的韻律模型距離,並融合聲學模型(GMM)與韻律模型分數得到最後的辨識結果。LPA 主要是利用資訊檢索的概念將SID 問題轉化成全文檢索的問題,經由下列三步驟(1) 韻律訊息標示化( tokenization), (2) 韻律訊息分析(LPA)及(3)語者檢索(speaker retrieval) 實現利用韻律訊息之強健性語者辨識。
    實驗使用 Handset TIMIT(HTIMIT)語料庫,以leave-one-out方式輪流使用九種不同的話筒當作未知話筒,驗證所提出之方法。實驗結果顯示,若以傳統 maximum likelihood a priori handset knowledge interpolation (ML-AKI) 的方法當作基礎(baseline),語者辨識率將可傳統pitch-GMM或 prosody bi-gram modeling 方法優異,無論對已知話筒和未知話筒皆能有效改善系統之強健性。


    Handsets that are not seen in the training phase (unseen handsets) are significant sources of performance degradation for speaker identification (SID) applications in the telecommunication environment. In this thesis, a novel latent prosody analysis (LPA) approach to automatically extract the most discriminative prosody cues for assisting in conventional spectral feature-based SID is proposed. The concept of the LPA approach is to transform the SID problem into a full-text document retrieval-like task via (1) prosodic contour tokenization, (2) latent prosody analysis, and (3) speaker retrieval. Experimental results of the phonetically balanced, read-speech, handset-TIMIT (HTIMIT) database demonstrated that the proposed method of fusing the LPA prosodic feature-based SID systems with maximum likelihood a priori handset knowledge interpolation (ML-AKI) spectral feature-based SID outperformed both the pitch and energy Gaussian mixture model (Pitch-GMM) and the bi-gram of the prosodic state (bi-gram) counterparts for both cases of counting all and only unseen handsets.

    Chapter 1. Introduction - 1 - 1.1. Background - 1 - 1.2. Outline of this Dissertation - 3 - Chapter 2. Latent Prosody Analysis - 4 - 2.1. Introduction - 4 - 2.2. Tokenization - 7 - 2.2.1. Inter-syllable Prosodic Feature Extraction - 8 - 2.2.2. Automatic Prosodic State Labeler - 9 - 2.2.3. Prosodic Keyword Parser - 10 - 2.3. Latent Prosody Analysis - 12 - 2.3.1. Construction of Prosodic Keyword-Speaker Co-occurrence Matrix - 14 - 2.3.2. Term Frequency and Inverse Document Frequency Method - 15 - 2.3.3. Construction of the Latent Prosody Space of Speakers - 16 - 2.4. Speaker Retrieval - 21 - 2.5. Fusion of Prosodic and Spectral Feature-based SID Scores - 23 - Chapter 3. Cluster-Based LPA - 25 - 3.1. Introduction - 25 - 3.2. Cluster-Based LPA Method - 26 - 3.3. Fusion of CD-LPA and CI-LPA SID Scores - 28 - 3.4. Fusion of LPA and Spectral Feature-based SID Scores - 30 - Chapter 4. Experiments - 32 - 4.1. The HTIMIT Database - 33 - 4.2. Experiment Conditions - 34 - 4.2.1. Training, Test and Extra Training Sets - 34 - 4.2.2. ML-AKI Spectral Feature-based SID Baseline - 36 - 4.2.3. Pitch-GMM and Bi-gram Prosodic Feature-based SID Baselines - 37 - 4.3. Experimental Results - 40 - 4.3.1. Spectral Feature-based SID Baseline - 40 - 4.3.2. Fusion of Spectral and Prosodic Feature-based SID systems - 41 - Chapter 5. Analysis and Discussions - 47 - 5.1. The Properties of LPA - 48 - 5.1.1. Automatic Prosodic State Labeling - 48 - 5.1.2. Constructed Latent Prosody Space of Speakers - 54 - 5.1.3. Speaker Entropy and the Constructed Latent Prosody Space - 57 - 5.2. Discussions on Experimental Results - 59 - 5.2.1. Sensitiveness to Telephone Handset and Speaker Gender - 59 - 5.2.2. Contribution of Different Prosodic Features to LPA - 63 - 5.3. Potentiality of Applying LPA to other Speaking Styles or Languages - 66 - Chapter 6. Conclusions and Future works - 67 - 6.1. Conclusions - 67 - 6.2. Future Works - 69 - REFERENCES - 70 -

    [1] J. P. Campbell, “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, Volume 85, Issue 9, Sept. 1997, 1437-1462.
    [2] M. Faundez-Zanuy and E. Monte-Moreno, “State-of-the-Art in Speaker Recognition,” IEEE Aerospace and Electronic Systems Magazine, Volume 20, Issue 5, March 2005, 7-12.
    [3] R. Mammone, X. Zhang, and R. Ramachandran, “Robust Speaker Recognition – A Feature-based Approach,” IEEE Signal Processing Magazine, Sept. 1996, 58-71.
    [4] H. A. Murthy, F. Beaufays, L. P. Heck, and M. Weintraub, “Robust Text-Independent Speaker Identification over Telephone Channels,” IEEE Trans. Speech Audio Processing, Volume 7, No. 5, September 1999.
    [5] J. Pelecanos and S. Sridharan, “Feature Warping for Robust Speaker Verification,” Proc. A Speaker Odyssey, 2001.
    [6] D. A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” Proc. ICASSP 2003, Volume 2, 2003, II – 53-6.
    [7] R. Teunen, B. Shahshahani, and L. P. Heck, “A Model Based Transformational Approach to Robust Speaker Recognition,” Proc. ICSLP''2000, vol.2, pp. 495-498, 2000.
    [8] Y. F. Liao, J. H. Yang, Z. X. Zhuang, and S. H. Chen, “A Priori Knowledge Interpolation-based Approach for Handset Mismatch-Compensated Speaker Identification,” submitted to IEEE Transactions on Audio, Speech and Language Processing.
    [9] Jyh-Her Yang and Yuan-Fu Liao, “Unseen Handset Mismatch Compensation Based On Feature/Model-Space A Priori Knowledge Interpolation For Robust Speaker Recognition”, ISCLSP, pp. 65 – 68, 2004.
    [10] D. A. Reyolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, Volume 10, Jan. 2000, 19-41.
    [11] M. K. Sonmez, L. Heck, M. Weintraub, and E. Shriberg, “A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition,” Proc. EUROSPEECH 1997 (Rhodes, Greece), Volume 3, September 1997, 1391-1394.
    [12] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, and S. Bennet, “Robust Prosodic Features for Speaker Identification,” Proc. ICSLP 1996, 1996, 1800-1803.
    [13] K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, “Modeling Dynamic Prosodic Variation for Speaker Verification,” In R. H. Mannell and J. Robert-Ribes (Eds.), Proc. ICSLP 1998 (Sydney), Volume 7, 1998, 3189-3192.
    [14] A. G. Adami, R. Mihaescu, D. A. Reynolds, and J. J. Godfrey, “Modeling Prosodic Dynamics for Speaker Recognition,” Proc. ICASS 2003, Volume 4, April 2003, IV – 788-91.
    [15] S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg, and A. Stolcke, “Modeling NERFs for Speaker Recognition,” Proc. Odyssey 2004 Speaker and Language Recognition Workshop (Toledo, Spain), pp. 51-56, June 2004.
    [16] D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang, “The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition,” Proc. ICASSP 2003, Volume IV, 2003, 784-787.
    [17] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, “Modeling Prosodic Feature Sequences for Speaker Recognition,” Speech Communication, Volume 46, 2005, 455-472.
    [18] “NIST - Speaker Recognition Evaluations,” http://www.nist.gov/speech/tests/spk/index.htm
    [19] “NIST 2001 Speaker Recognition Evaluation – Extended Data task,” http://www.nist.gov/speech/tests/spk/2001/extended-data/
    [20] Z. H. Chen, Y. F. Liao and Y. T. Juang, “Prosodic modeling and Eigen-Prosody Analysis for Robust Speaker Recognition,” Proc. ICASSP 2005, Volume 1, Issue , March 18-23, 2005 Page(s): 185 - 188.
    [21] R. Baeza-Yates and B. Riberiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
    [22] L. P. Jing, H. K. Huang, H. B. Shi, “Improved Feature Selection Approach TFIDF in Text Mining,” Proc. 2002 International Conference on Machine Learning and Cybernetics, Volume 2, 2002, 944 - 946.
    [23] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer,R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information Retrieval Using A Singular Value Decomposition Model of Latent Semantic Structure,” Proc. SIGIR, 1988, 465-480.
    [24] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science 41, pp. 391-407, 1990.
    [25] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Fracisco, CA (pp. 289-296), 1999.
    [26] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 42, 2001, 177-196.
    [27] TIMIT Speech Database, http://www.mpi.nl/world/tg/corpora/timit/timit.html
    [28] D. A. Reynolds, “HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects,” Proc. ICASSP 1997, Volume 2, 1997, 1535-1538.
    [29] M. Hasegawa-Johnson, K. Chen, J. Cole, S. Borys, S. S. Kim, A. Cohen, T. Zhang, J. Y. Choi, H. Kim, T. Yoon, and S. Chavarria, “Simultaneous Recognition of Words and Prosody in the Boston University Radio Speech Corpus,” Speech Communication, 46(3-4), 2005, 418-439.
    [30] K. Chen, M. Hasegawa-Johnson, A. Cohen, S. Borys, S. S. Kim, J. Cole, and J.Y. Choi, “Prosody Dependent Speech Recognition on Radio News Corpus of American English,” IEEE Transactions on Speech and Audio Processing, 14(1), 2006, 232-245.
    [31] K. J. Chen and W. Y. Ma, “Unknown Word Extraction for Chinese Documents,” Proc. COLING 2002, 2002, 169-175.
    [32] P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Ser. B, 39, 1977, 1-38.
    [33] T. J. Hazen, “A Comparison of Novel Techniques for Rapid Speaker Adaptation,” Speech Communication, Volume 31, May 2000, 15-33.
    [34] C. J. Leggetter and P. C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech Lang., Vol. 9, 1995, 171-185.
    [35] M. Nishida, T. Kawahara, “Speaker Indexing and Adaptation using Speaker Clustering Based on Statistical Model Selection,” Proc. ICASSP 2004, Volume 1, 17-21, May 2004, I – 353-56.
    [36] D. Lilt and F. Kubala, “Online Speaker Clustering,” Proc. ICASSP 2004, 2004, Volume 1, I – 333-6.
    [37] J. L. Gauvain and C. H. Lee, “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. on Speech and Audio Processing, 2, 1994, 291-298.
    [38] B. H. Juang, W. Chou, and C. H. Lee, “Minimum Classification Error Rate Methods for Speech Recognition,” IEEE Trans. on Speech and Audio Processing. Volume 5, No. 3, May 1997.
    [39] K. Sjölander and J. Beskow, “Wavesurfer,” http://www.speech.kth.se/wavesurfer/
    [40] K. Sjölander, “Snack Sound Toolkit,” http://www.speech.kth.se/snack/
    [41] I. J. Good, “The Population Frequencies of Species and the Estimation of Population Parameters,” Biometrika, Volume 40 (3, 4), 1953, 237-264.
    [42] G. Doddington, “Speaker Recognition based on Idiolectal Differences between Speakers,” Proc. EUROSPEECH 2001 (Aalborg, Denmark), 2001, 2521-2524.
    [43] B. Xiang, “Text-independent Speaker Verification with a Dynamic Trajectory Model,” IEEE Signal Processing Letters, 10(5), 2003, 141-143.
    [44] Z. H. Chen, Z. R. Zeng, Y. F. Liao, and Y. T. Juang, “Probabilistic Latent Prosody Analysis for Robust Speaker Verification,” Proc. ICASSP 2006, 2006.
    [45] W. C. Chang, D. Y. Chen, Z. H. Chen, Z. R. Zeng, Y. F. Liao, and Y. T. Juang, “Incorporating Prosodic with Acoustic information for ISCSLP 2006 Speaker Recognition Evaluation – Robust Cross-Channel Speaker Verification,” Proc. ISCSLP 2006, 2006.

    QR CODE
    :::