應用於語者驗證之雙序列門控注意力單元架構｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳登國 Tran Dang Khoa
論文名稱：	應用於語者驗證之雙序列門控注意力單元架構 Dual-Sequences Gated Attention Unit Architecture for Speaker Verification
指導教授：	蔡宗漢 Tsung-Han Tsai
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	54
中文關鍵詞：	應用於語者驗證之雙序列門控注意力單元架構
相關次數：	點閱：11 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本文中，我們提出了一種GRU結構的變體，稱為雙序列門控注意單元（DS-GAU），其中計算了x向量基線的每個TDNN層的統計池，並將其通過DS-GAU層傳遞，在訓練為幀級時從輸入要素的不同時間上下文中聚合更多信息。我們提出的架構在VoxCeleb2數據集上進行了訓練，其中特徵向量稱為DSGAU-向量。我們對VoxCeleb1數據集和“野生演說者”（SITW）數據集進行了評估，並將實驗結果與x矢量基線系統進行了比較。結果表明，相對於VoxCeleb1數據集的x向量基線，我們提出的方法在EER相對改進方面最多可存檔11.6％，7.9％和7.6％.

In this thesis, we present a variant of GRU architecture called Dual-Sequences Gated Attention Unit (DS-GAU), in which the statistics pooling from each TDNN layer of the x-vector baseline are computed and passed through the DS-GAU layer, to aggregate more information from the variant temporal context of input features while training as frame-level. Our proposed architecture was trained on the VoxCeleb2 dataset, where the feature vector is referred to as a DSGAU-vector. We made our evaluation on the VoxCeleb1 dataset and the Speakers in the Wild (SITW) dataset and compared the experimental results with the x-vector baseline system. It showed that our proposed method archived up to 11.6%, 7.9%, and 7.6% in EER relative improvements over the x-vector baseline on the VoxCeleb1 dataset.

CONTENTS
  Introduction    1
1    Motivations    1
2    Thesis Organization    2
  Background    4
1    Time-delay neural networks (TDNN)    5
2    Baseline x-vector system    5
3    Extension topology of x-vector    7
3.1    Extended-TDNN (E-TDNN)    7
3.2    Factorized TDNN (F-TDNN)    8
4    DSP-vector system    9
4.1    DSP-vector structure    9
4.2    DSP-LSTM architecture    11
  Dual-Sequences Gated Attention Unit (DS-GAU)    14
1    Dual-Sequences Gated Attention Unit (DS-GAU) Vector Network Architecture    14
2    Dual-Sequences Gated Attention Unit (DS-GAU)    14
2.1    Recurrent Attention Unit (RAU)    14
2.2    Gated Attention Unit (GAU)    17
2.3    Dual-Sequences Gated Attention Unit (DS-GAU)    19
  Experimental Setups and Results    23
1    Data preparation    23
1.1    Dataset preparation and metrics    23
1.2    Pre-processing speaker features    25
1.3    Backend classifier    25
2    Experimental results    25
2.1    Evaluation on VoxCeleb1 dataset    26
2.2    Evaluation on SITW dataset    29
  Conclusion And Future Recommendations    36
  References    38


                                

[1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011, doi: 10.1109/TASL.2010.2064307.
[2] S. Ioffe, “Probabilistic linear discriminant analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3954 LNCS, pp. 531–542, 2006, doi: 10.1007/11744085_41.
[3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” Proc. Odyssey Speak. Lang. Recogntion Work. Brno, Czech Repub., 2010.
[4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4052–4056, 2014, doi: 10.1109/ICASSP.2014.6854363.
[5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-End Text-Dependent Speaker Verification,” 2016, pp. 5115–5119.
[6] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 4879–4883, 2018, doi: 10.1109/ICASSP.2018.8462665.
[7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 999–1003, 2017, doi: 10.21437/Interspeech.2017-620.
[8] V. Peddinti, D. Povey, and S. Khudanpur, “Atimedelayneuralnetworkarchitectureforefﬁcientmodelingoflong temporalcontexts.pdf,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2015-Janua, pp. 2–6, 2015.
[9] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker Recognition for Multi-speaker Conversations Using X-vectors,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2019-May, pp. 5796–5800, 2019, doi: 10.1109/ICASSP.2019.8683760.
[10] B. Gu, W. Guo, L. Dai, and J. Du, “An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales,” pp. 6814–6818, 2020, doi: 10.1109/icassp40776.2020.9054151.
[11] “Chen, Chia-Ping, Su-Yu Zhang, Chih-Ting Yeh, Jia-Ching Wang, Tenghui Wang, and Chien-Lin Huang. ‘Speaker characterization using tdnn-lstm based speaker embedding.’ In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processin,” pp. 6211–6215, 2019.
[12] Q.-B. Hong, C.-H. Wu, H.-M. Wang, and C.-L. Huang, “Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification,” ICASSP 2020-2020 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6849–6853, 2020, doi: 10.1109/icassp40776.2020.9054350.
[13] F. A. R. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-Based Models for Text-Dependent Speaker Verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, no. 2, pp. 5359–5363, 2018, doi: 10.1109/ICASSP.2018.8461587.
[14] M. H. Rahman, I. Himawan, M. Mclaren, C. Fookes, and S. Sridharan, “Employing phonetic information in DNN speaker embeddings to improve speaker recognition performance,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3593–3597, 2018, doi: 10.21437/Interspeech.2018-1804.
[15] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, pp. 3573–3577, 2018, doi: 10.21437/Interspeech.2018-1158.
[16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” pp. 1–9, 2014, [Online]. Available: http://arxiv.org/abs/1412.3555.
[17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust dnn embeddings for speaker recognition,” 2018 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5329–5333, 2018.
[18] D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. 2, pp. 3743–3747, 2018, doi: 10.21437/Interspeech.2018-1417.
[19] G. Zhong, G. Yue, and X. Ling, “Recurrent attention unit,” arXiv, 2018.
[20] Y. Qin, D. Chen, S. Xiang, and C. Zhu, “Gated dual attention unit neural networks for remaining useful life prediction of rolling bearings,” IEEE Trans. Ind. Informatics, vol. 3203, no. c, pp. 1–1, 2020, doi: 10.1109/tii.2020.2999442.
[21] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep speaker recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. i, pp. 1086–1090, 2018, doi: 10.21437/Interspeech.2018-1929.
[22] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” pp. 2–5, 2015, [Online]. Available: http://arxiv.org/abs/1510.08484.
[23] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition using X-vectors,” pp. 105–111, 2018, doi: 10.21437/odyssey.2018-15.
[24] A. Nagraniy, J. S. Chungy, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017, doi: 10.21437/Interspeech.2017-950.
[25] M. Mclaren, A. Lawson, L. Ferrer, D. Castán, and M. Graciarena, “The Speakers in the Wild Speaker Recognition Challenge Plan,” pp. 818–822, 2016.
[26] J. Villalba et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” Comput. Speech Lang., vol. 60, 2020, doi: 10.1016/j.csl.2019.101026.
[27] M. Ravanelli, T. Parcollet, Y. Bengio, and C. Fellow, “the pytorch-kaldi speech recognition toolkit mila , Universit ´ e de Montr ´ LIA , Universit ´ e d ’ Avignon,” pp. 6465–6469, 2019.

簡易檢索 / 詳目顯示

相關論文