| 研究生: |
陳登國 Tran Dang Khoa |
|---|---|
| 論文名稱: |
應用於語者驗證之雙序列門控注意力單元架構 Dual-Sequences Gated Attention Unit Architecture for Speaker Verification |
| 指導教授: |
蔡宗漢
Tsung-Han Tsai |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 應用於語者驗證之雙序列門控注意力單元架構 |
| 相關次數: | 點閱:11 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本文中,我們提出了一種GRU結構的變體,稱為雙序列門控注意單元(DS-GAU),其中計算了x向量基線的每個TDNN層的統計池,並將其通過DS-GAU層傳遞, 在訓練為幀級時從輸入要素的不同時間上下文中聚合更多信息。 我們提出的架構在VoxCeleb2數據集上進行了訓練,其中特徵向量稱為DSGAU-向量。 我們對VoxCeleb1數據集和“野生演說者”(SITW)數據集進行了評估,並將實驗結果與x矢量基線系統進行了比較。 結果表明,相對於VoxCeleb1數據集的x向量基線,我們提出的方法在EER相對改進方面最多可存檔11.6%,7.9%和7.6%.
In this thesis, we present a variant of GRU architecture called Dual-Sequences Gated Attention Unit (DS-GAU), in which the statistics pooling from each TDNN layer of the x-vector baseline are computed and passed through the DS-GAU layer, to aggregate more information from the variant temporal context of input features while training as frame-level. Our proposed architecture was trained on the VoxCeleb2 dataset, where the feature vector is referred to as a DSGAU-vector. We made our evaluation on the VoxCeleb1 dataset and the Speakers in the Wild (SITW) dataset and compared the experimental results with the x-vector baseline system. It showed that our proposed method archived up to 11.6%, 7.9%, and 7.6% in EER relative improvements over the x-vector baseline on the VoxCeleb1 dataset.
[1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011, doi: 10.1109/TASL.2010.2064307.
[2] S. Ioffe, “Probabilistic linear discriminant analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3954 LNCS, pp. 531–542, 2006, doi: 10.1007/11744085_41.
[3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” Proc. Odyssey Speak. Lang. Recogntion Work. Brno, Czech Repub., 2010.
[4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4052–4056, 2014, doi: 10.1109/ICASSP.2014.6854363.
[5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-End Text-Dependent Speaker Verification,” 2016, pp. 5115–5119.
[6] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 4879–4883, 2018, doi: 10.1109/ICASSP.2018.8462665.
[7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 999–1003, 2017, doi: 10.21437/Interspeech.2017-620.
[8] V. Peddinti, D. Povey, and S. Khudanpur, “Atimedelayneuralnetworkarchitectureforefficientmodelingoflong temporalcontexts.pdf,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2015-Janua, pp. 2–6, 2015.
[9] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker Recognition for Multi-speaker Conversations Using X-vectors,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2019-May, pp. 5796–5800, 2019, doi: 10.1109/ICASSP.2019.8683760.
[10] B. Gu, W. Guo, L. Dai, and J. Du, “An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales,” pp. 6814–6818, 2020, doi: 10.1109/icassp40776.2020.9054151.
[11] “Chen, Chia-Ping, Su-Yu Zhang, Chih-Ting Yeh, Jia-Ching Wang, Tenghui Wang, and Chien-Lin Huang. ‘Speaker characterization using tdnn-lstm based speaker embedding.’ In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processin,” pp. 6211–6215, 2019.
[12] Q.-B. Hong, C.-H. Wu, H.-M. Wang, and C.-L. Huang, “Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification,” ICASSP 2020-2020 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6849–6853, 2020, doi: 10.1109/icassp40776.2020.9054350.
[13] F. A. R. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-Based Models for Text-Dependent Speaker Verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, no. 2, pp. 5359–5363, 2018, doi: 10.1109/ICASSP.2018.8461587.
[14] M. H. Rahman, I. Himawan, M. Mclaren, C. Fookes, and S. Sridharan, “Employing phonetic information in DNN speaker embeddings to improve speaker recognition performance,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3593–3597, 2018, doi: 10.21437/Interspeech.2018-1804.
[15] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, pp. 3573–3577, 2018, doi: 10.21437/Interspeech.2018-1158.
[16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” pp. 1–9, 2014, [Online]. Available: http://arxiv.org/abs/1412.3555.
[17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust dnn embeddings for speaker recognition,” 2018 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5329–5333, 2018.
[18] D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. 2, pp. 3743–3747, 2018, doi: 10.21437/Interspeech.2018-1417.
[19] G. Zhong, G. Yue, and X. Ling, “Recurrent attention unit,” arXiv, 2018.
[20] Y. Qin, D. Chen, S. Xiang, and C. Zhu, “Gated dual attention unit neural networks for remaining useful life prediction of rolling bearings,” IEEE Trans. Ind. Informatics, vol. 3203, no. c, pp. 1–1, 2020, doi: 10.1109/tii.2020.2999442.
[21] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep speaker recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. i, pp. 1086–1090, 2018, doi: 10.21437/Interspeech.2018-1929.
[22] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” pp. 2–5, 2015, [Online]. Available: http://arxiv.org/abs/1510.08484.
[23] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition using X-vectors,” pp. 105–111, 2018, doi: 10.21437/odyssey.2018-15.
[24] A. Nagraniy, J. S. Chungy, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017, doi: 10.21437/Interspeech.2017-950.
[25] M. Mclaren, A. Lawson, L. Ferrer, D. Castán, and M. Graciarena, “The Speakers in the Wild Speaker Recognition Challenge Plan,” pp. 818–822, 2016.
[26] J. Villalba et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” Comput. Speech Lang., vol. 60, 2020, doi: 10.1016/j.csl.2019.101026.
[27] M. Ravanelli, T. Parcollet, Y. Bengio, and C. Fellow, “the pytorch-kaldi speech recognition toolkit mila , Universit ´ e de Montr ´ LIA , Universit ´ e d ’ Avignon,” pp. 6465–6469, 2019.