跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳登國
Tran Dang Khoa
論文名稱: 應用於語者驗證之雙序列門控注意力單元架構
Dual-Sequences Gated Attention Unit Architecture for Speaker Verification
指導教授: 蔡宗漢
Tsung-Han Tsai
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 54
中文關鍵詞: 應用於語者驗證之雙序列門控注意力單元架構
相關次數: 點閱:11下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本文中,我們提出了一種GRU結構的變體,稱為雙序列門控注意單元(DS-GAU),其中計算了x向量基線的每個TDNN層的統計池,並將其通過DS-GAU層傳遞, 在訓練為幀級時從輸入要素的不同時間上下文中聚合更多信息。 我們提出的架構在VoxCeleb2數據集上進行了訓練,其中特徵向量稱為DSGAU-向量。 我們對VoxCeleb1數據集和“野生演說者”(SITW)數據集進行了評估,並將實驗結果與x矢量基線系統進行了比較。 結果表明,相對於VoxCeleb1數據集的x向量基線,我們提出的方法在EER相對改進方面最多可存檔11.6%,7.9%和7.6%.


    In this thesis, we present a variant of GRU architecture called Dual-Sequences Gated Attention Unit (DS-GAU), in which the statistics pooling from each TDNN layer of the x-vector baseline are computed and passed through the DS-GAU layer, to aggregate more information from the variant temporal context of input features while training as frame-level. Our proposed architecture was trained on the VoxCeleb2 dataset, where the feature vector is referred to as a DSGAU-vector. We made our evaluation on the VoxCeleb1 dataset and the Speakers in the Wild (SITW) dataset and compared the experimental results with the x-vector baseline system. It showed that our proposed method archived up to 11.6%, 7.9%, and 7.6% in EER relative improvements over the x-vector baseline on the VoxCeleb1 dataset.

    CONTENTS 1 Introduction 1 1.1 Motivations 1 1.2 Thesis Organization 2 2 Background 4 2.1 Time-delay neural networks (TDNN) 5 2.2 Baseline x-vector system 5 2.3 Extension topology of x-vector 7 2.3.1 Extended-TDNN (E-TDNN) 7 2.3.2 Factorized TDNN (F-TDNN) 8 2.4 DSP-vector system 9 2.4.1 DSP-vector structure 9 2.4.2 DSP-LSTM architecture 11 3 Dual-Sequences Gated Attention Unit (DS-GAU) 14 3.1 Dual-Sequences Gated Attention Unit (DS-GAU) Vector Network Architecture 14 3.2 Dual-Sequences Gated Attention Unit (DS-GAU) 14 3.2.1 Recurrent Attention Unit (RAU) 14 3.2.2 Gated Attention Unit (GAU) 17 3.2.3 Dual-Sequences Gated Attention Unit (DS-GAU) 19 4 Experimental Setups and Results 23 4.1 Data preparation 23 4.1.1 Dataset preparation and metrics 23 4.1.2 Pre-processing speaker features 25 4.1.3 Backend classifier 25 4.2 Experimental results 25 4.2.1 Evaluation on VoxCeleb1 dataset 26 4.2.2 Evaluation on SITW dataset 29 5 Conclusion And Future Recommendations 36 6 References 38

    [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011, doi: 10.1109/TASL.2010.2064307.
    [2] S. Ioffe, “Probabilistic linear discriminant analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3954 LNCS, pp. 531–542, 2006, doi: 10.1007/11744085_41.
    [3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” Proc. Odyssey Speak. Lang. Recogntion Work. Brno, Czech Repub., 2010.
    [4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4052–4056, 2014, doi: 10.1109/ICASSP.2014.6854363.
    [5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-End Text-Dependent Speaker Verification,” 2016, pp. 5115–5119.
    [6] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 4879–4883, 2018, doi: 10.1109/ICASSP.2018.8462665.
    [7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 999–1003, 2017, doi: 10.21437/Interspeech.2017-620.
    [8] V. Peddinti, D. Povey, and S. Khudanpur, “Atimedelayneuralnetworkarchitectureforefficientmodelingoflong temporalcontexts.pdf,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2015-Janua, pp. 2–6, 2015.
    [9] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker Recognition for Multi-speaker Conversations Using X-vectors,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2019-May, pp. 5796–5800, 2019, doi: 10.1109/ICASSP.2019.8683760.
    [10] B. Gu, W. Guo, L. Dai, and J. Du, “An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales,” pp. 6814–6818, 2020, doi: 10.1109/icassp40776.2020.9054151.
    [11] “Chen, Chia-Ping, Su-Yu Zhang, Chih-Ting Yeh, Jia-Ching Wang, Tenghui Wang, and Chien-Lin Huang. ‘Speaker characterization using tdnn-lstm based speaker embedding.’ In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processin,” pp. 6211–6215, 2019.
    [12] Q.-B. Hong, C.-H. Wu, H.-M. Wang, and C.-L. Huang, “Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification,” ICASSP 2020-2020 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6849–6853, 2020, doi: 10.1109/icassp40776.2020.9054350.
    [13] F. A. R. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-Based Models for Text-Dependent Speaker Verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, no. 2, pp. 5359–5363, 2018, doi: 10.1109/ICASSP.2018.8461587.
    [14] M. H. Rahman, I. Himawan, M. Mclaren, C. Fookes, and S. Sridharan, “Employing phonetic information in DNN speaker embeddings to improve speaker recognition performance,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3593–3597, 2018, doi: 10.21437/Interspeech.2018-1804.
    [15] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, pp. 3573–3577, 2018, doi: 10.21437/Interspeech.2018-1158.
    [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” pp. 1–9, 2014, [Online]. Available: http://arxiv.org/abs/1412.3555.
    [17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust dnn embeddings for speaker recognition,” 2018 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5329–5333, 2018.
    [18] D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. 2, pp. 3743–3747, 2018, doi: 10.21437/Interspeech.2018-1417.
    [19] G. Zhong, G. Yue, and X. Ling, “Recurrent attention unit,” arXiv, 2018.
    [20] Y. Qin, D. Chen, S. Xiang, and C. Zhu, “Gated dual attention unit neural networks for remaining useful life prediction of rolling bearings,” IEEE Trans. Ind. Informatics, vol. 3203, no. c, pp. 1–1, 2020, doi: 10.1109/tii.2020.2999442.
    [21] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep speaker recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. i, pp. 1086–1090, 2018, doi: 10.21437/Interspeech.2018-1929.
    [22] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” pp. 2–5, 2015, [Online]. Available: http://arxiv.org/abs/1510.08484.
    [23] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition using X-vectors,” pp. 105–111, 2018, doi: 10.21437/odyssey.2018-15.
    [24] A. Nagraniy, J. S. Chungy, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017, doi: 10.21437/Interspeech.2017-950.
    [25] M. Mclaren, A. Lawson, L. Ferrer, D. Castán, and M. Graciarena, “The Speakers in the Wild Speaker Recognition Challenge Plan,” pp. 818–822, 2016.
    [26] J. Villalba et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” Comput. Speech Lang., vol. 60, 2020, doi: 10.1016/j.csl.2019.101026.
    [27] M. Ravanelli, T. Parcollet, Y. Bengio, and C. Fellow, “the pytorch-kaldi speech recognition toolkit mila , Universit ´ e de Montr ´ LIA , Universit ´ e d ’ Avignon,” pp. 6465–6469, 2019.

    QR CODE
    :::