| 研究生: |
蔡曜丞 Yao-Cheng Tsai |
|---|---|
| 論文名稱: |
基於遞迴神經網路之聲學回聲消除技術 Acoustic Echo Cancellation Based on Recurrent Neural Network |
| 指導教授: |
張寶基
Pao-Chi Chang |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 通訊工程學系 Department of Communication Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 68 |
| 中文關鍵詞: | 深度學習 、聲學回聲消除 、語音分離 、遞迴神經網路 |
| 外文關鍵詞: | Deep Learning, Acoustic Echo Cancellation, Speech Separation, Recurrent Neural Network |
| 相關次數: | 點閱:7 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
時至今日為止,聲學回聲消除 (Acoustic Echo Cancellation, AEC) 都是一個在語音和信號處理中常見的問題。應用的場景如電話會議,免持聽筒和移動通信。在過去我們用可適性濾波器來處理聲學回聲消除的問題,而今日我們可以用深度學習的方式來解決聲學回聲消除中複雜的問題。
本篇論文提出的方法則是把聲學回聲消除視為語音分離的問題,取代傳統的可適性濾波器估測聲學回聲。並利用深度學習中的遞迴神經網路 (Recurrent Neural Network, RNN) 架構去訓練模型。由於遞迴神經網絡模擬時變函數的能力良好,所以可以在解決聲學回聲消除問題中發揮作用。我們訓練具有記憶的雙向的長短期記憶網路 (Long Short Term Memory Network, LSTM) 及雙向的門控遞迴單元 (Gated Recurrent Unit, GRU) 的遞迴神經網絡。從單講語音以及雙講語音中提取特徵,並透過調整權重來控制特徵之間的大小比例,來估計理想比例掩蔽(Ideal Ratio Mask, IRM)。利用這種方式來分離信號,從而達到去除回聲的目的。實驗結果表明該方法消除回聲的效果良好。
Acoustic echo cancellation is a common problem in speech and signal processing until now. Application scenarios such as telephone conference, hands-free handsets and mobile communications. In the past we used adaptive filters to deal with acoustic echo cancellation, and today we can use deep learning to solve complex problems in acoustic echo cancellation.
The method proposed in this work is to consider acoustic echo cancellation as a problem of speech separation, instead of the traditional adaptive filter to estimate acoustic echo. And use the recurrent neural network architecture in deep learning to train the model. Since the recurrent neural network has a good ability to simulate time-varying functions, it can play a role in solving the problem of acoustic echo cancellation. We train a bidirectional long short-term memory network and a bidirectional gated recurrent unit. Features are extracted from single-talk speech and double-talk speech. Adjust weights to control the ratio between double-talk speech and single-talk speech, and estimate the ideal ratio mask. This way to separate the signal, in order to achieve the purpose of removing the echo. The experimental results show that the method has good effect in echo cancellation.
[1] Peter Wilson, "Design Recipes for FPGAs (Second Edition), Chapter 9 - Digital Filters", pp. 117-134, Elsevier, 2016.
[2] J. Benesty and P. Duhamel, “A fast exact least mean square adaptive algorithm,” IEEE Trans. Signal Processing, vol. 40, pp. 2904–2920, 1992.
[3] Mohd Zaizu Ilyas, Ali O. Noor, Khairul Anuar Ishak, Aini Hussain, Salina Abdul Samad, "Normalized Least Mean Square Adaptive Noise Cancellation Filtering forSpeaker Verification in Noisy Environments", International Conference on Electronic Design (2008)
[4] https://www.speex.org/
[5] J. S. Soo and K. K. Pang, “Multidelay block frequency domain adaptive filter,” IEEE Trans. Acoust. Speech Signal Process., vol. 38, no. 2, pp. 373–376, Feb. 1990.
[6] Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460.
[7] Searle, J. R. (1980) Minds, brains, and programs. Behavioral and Brain Sciences 3:417–57.
[8] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp.115-133, Dec. 1943.
[9] F. A. Makinde, C. T. Ako, O. D. Orodu, I. U. Asuquo, "Prediction of crude oil viscosity using feed-forward back-propagation neural network (FFBPNN)," Petroleum and Coal , vol. 54, pp. 120-131, 2012.
[10] D. O. Hebb, “Organization of Behavior,” New York: Wiley & Sons.
[11] Rosenblatt, F. The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain. Psychological Review. 1958
[12] M. Minsky, S. Papert, “Perceptrons,” Cambridge, MA: MIT Press.
[13] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.
[14] M. Minsky and S. Paper, “Perceptrons,” Cambridge, MA: MIT Press.
[15] J.J.Hopfield, “Neural networks and physical systems with emergent collective computational abilities”, Proc. Nut. Acad. Sci., U.S., vol. 79, pp. 2554-2558, Apr. 1982.
[16] L. F. Lamel, R. H. Kassel, and S. Seneff, “Speech database development: Design and analysis of the acoustic-phonetic corpus,” in Speech Input/Output Assessment and Speech Databases, 1989.
[17] S.Hochreiter, J.Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[18] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555 [cs], December 2014.
[19] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
[20] D. Yu, M. Kolbak, Z.-H. Tan, and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," in Proceedings of ICASSP, pp. 241-245, 2017
[21] Y. Wang, A. Narayanan, and D.L. Wang, "On training targets for supervised speech separation," IEEE/ACM Trans. Audio Speech Lang. Proc., vol. 22, pp. 1849-1858, 2014.
[22] TensorFlow: an open source Python package for machine intelligence, https://www.ten-sorflow.org, retrieved Dec. 1, 2016.
[23] J. Dean, et al. “Large-Scale Deep Learning for Building Intelligent Computer Systems,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Min-ing, pp. 1-1, Feb. 2016.
[24] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[25] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Conference, Jul. 2015.