| 研究生: |
林金賢 Jin-Sian Lin |
|---|---|
| 論文名稱: |
深度學習用於語音回響抑制之研究 Study of speech dereverberation based on deep learning approach |
| 指導教授: |
吳炤民
Chao-Min Wu |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 86 |
| 中文關鍵詞: | 深度學習 、回響抑制 |
| 外文關鍵詞: | Deep learning, Dereverberation |
| 相關次數: | 點閱:15 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
回響通常是由天花板、牆壁和地板的聲音反射所造成的,在我們生活的環境中處處都會有回響的存在。對於正常人耳而言,回響所造成的影響並不明顯,不過對於助聽器或其他聽覺輔具的使用者而言,回響會嚴重影響他們語音接收的品質,即使在安靜的環境下,也可能會聽不清楚。現有的傳統除回響方法,雖然也可以表現出相當不錯的性能,但它們仍都需要已知的環境特性來抑制回響,這在真實環境下很難去實現。現今,深度學習發展迅速,利用大量的訓練資料來訓練深度神經網路(Deep neural network, DNN)便可以得到輸出與輸入之間的非線性關係,改善了傳統方法對環境的依賴性。本論文利用實驗室先前所錄製的TMHINT(Taiwan mandarin hearing in noise test)句子作為實驗語料,模擬了許多不同環境下的回響語料來進行訓練(2160句)及測試(480句),再從語音中萃取對數功率聲譜(Logarithmic power spectrogram, LPS)作為輸入特徵,讓深度神經網路來進行監督式學習。本實驗中使用的神經網路架構有深層降噪自編碼器(Deep denoise autoencoder, DDAE)與整體式深度與集成學習演算法(Integrated deep and ensemble learning algorithm, IDEA),並比較他們彼此間的優劣勢及結合其他網路架構所呈現出來的結果,依據不同的訓練目標,網路的性能也不一致。在這我們也比較了映射(Mapping)與遮罩(Masking)方式的區別。為了證實比較結果的可信度,我們使用了國外語音研究上常用的TIMIT語料,加以驗證我們的結果。最後,藉由語音品質感知度(Perceptual evaluation of speech quality, PESQ)與短時客觀語音清晰度(Short time objective intelligibility, STOI)等評估方法來對各項結果做評估,來找出最合適的網路架構及輸出目標。評估結果表明,DDAE與IDEA兩者跟殘差網路(Residual networks)做結合的效益是最佳的(PESQ平均值2.2以上、STOI平均值0.8以上),而在遮罩目標下,DDAE無論是在架構上或是回響抑制能力上的表現,都明顯優於IDEA。
Reverberation, generally caused by sound reflections from ceilings, floors, and walls, exists everywhere in the environment we live in. For normal human ears, the effect of reverberation is not obvious. However, for the people who need hearing aids or other assistive hearing devices, reverberation significantly affect the quality of speech reception. Even in a noiseless environment, reverberation still makes people with hearing loss unable to hear clearly. Although traditional dereverberation approaches can show reasonably good performance, they still rely on the knowledge of environmental characteristics, which are difficult to be obtained in the real environment. Nowadays, the rapid-growing deep learning is a powerful tool that can be used for dereverberation. By using a large amount of data to train the deep neural networks (DNNs), we can obtain the nonlinear relationship between input and output. Comparing to the traditional methods, DNN eliminates the environment dependence and improve the performance. In this thesis, sentences from TMHINT (Taiwan mandarin hearing in noise test) previously recorded by our research team, are chosen as the speech material for experiments, and simulated the reverberant speech under different conditions for training (2160 sentences) and testing (480 sentences). The logarithmic power spectrum (LPS) was extracted from the speech as the input feature, and the DNN is used for supervised learning. The neural network architecture utilized in this experiment includes the deep denoising autoencoder (DDAE) and the integrated deep and ensemble learning algorithm (IDEA). This research, compares their advantages and disadvantages, and combines with other network architectures. Different training targets with the same network are also compared for the performance. The differences between mapping and masking methods are evaluated. In order to verify the credibility of the comparison results, we also used the TIMIT corpus for experiments. The evaluation methods perceptual evaluation of speech quality (PESQ) and short-time objective voice intelligibility (STOI) are used to assess the results, which give most suitable network architecture and output target. The evaluation results showed that both of the combination of DDAE with residual network and IDEA with residual network were the best among all of the methods. (Average PESQ score is equal to 2.2 or more, while STOI is equal to 0.8 or more). Furthermore, under masking, DDAE offered a better indications of the architecture and dereverberation capability compared to IDEA.
Allen, J. B., & Berkley, D. A. (1979). “Image Method for Efficiently Simulating Small Room Acoustics,”. Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943-950.
Bees, D., Blostein, M., & Kabal, P. (1991). Reverberant speech enhancement using cepstral processing. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 977-980.
Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing :Ch. 4.6.
Delcroix, M., Yoshioka, T., Ogawa, A., & Kubo, Y. (2014). “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge,”. in Proc. REVERB Challenge, pp. 1–8.
Delfarah, M., & Wang, D. L. (2017). “Features for masking-based monaural speech separation in reverberant conditions,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1085–1094.
Erdogan, H., Hershey, J. R., Watanabe, S., & Roux, J. L. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 708-712.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. Tech. Rep, vol. 93.
Gillespie, B. W., Malvar, S. H., & Florencio, D. A. (2001). Speech dereverberation via maximum-kurtosis subband adaptive filtering. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3701-3704.
Habets, E. A. (2010). Room impulse response generator. Technische Universiteit Eindhoven.
Han, K., Wang, Y., & Wang, D. (2014). “Learning spectral mapping for speech dereverberation,”. IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 4661–4665.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. in Proc. Computer Vision and Pattern Recognition, pp. 770-778.
Hussain, T., Siniscalchi, S. M., Lee, C. -C., Wang, S. -S., Tsao, Y., & Liao, W. -H. (2017). Experimental Study on Extreme Learning Machine Applications for Speech Enhancement. IEEE Access, vol. 5, pp. 25542-25554.
Jin, Z., & Wang, D. L. (2009). “Supervised learning approach to monaural segregation of reverberant speech,”. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 625–638.
Lee, W. J., Wang, S. S., Chen, F., Lu, X., Chien, S. Y., & Tsao, Y. (2018). Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5454-5458.
Li, J., Akagi, M., & Suzuki, Y. (2006). “Noise reduction based on microphone array and post-fltering for robust hands-free speech recognition in adverse environments,”. Ph.D. dissertation, School of Information Science, Japan Advanced Institute of Science and Technology, Japan.
Loizou, P. C. (2007). Speech Enhancement: Theory and Practice.
Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for speech spectrum restoration. in Proc. INTERSPEECH.
Ma, J., Hu, Y., & Loizou, P. C. (2009). Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am, 125(5), pp. 3387-3405.
Miyoshi, M., & Kaneda, Y. (1988). “Inverse filtering of room acoustics,”. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 2, pp. 145–152.
Mohammadiha, N., & Doclo, S. (2016). “Speech dereverberation using nonnegative convolutive transfer function and spectro-temporal modeling,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 276–289.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. in Proceedings of theInternational Conference on Machine Learning, pp. 807-814.
Neely, S. T., & Allen, J. B. (1979). “Invertibility of a room impulse response,”. Journal of the Acoustical Society of America, vol. 66, pp. 165–169.
Nisa, H. K. (2021). Speech dereverberation based on HELM framework for cochlear implant coding strategy. Master's Thesis, Institute of Electrical Engineering, National Central University.
Radlovic, B. D., Williamson, R. C., & Kennedy, R. A. (2000). Equalization in an acoustic reverberant environment: robustness results. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 3, pp. 311-319.
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 749-752.
SrivastavaKR, GreffK, & SchmidhuberJ. (2015). Highway networks. CoRR, vol. abs/1505.00387.
STEVEN F.BOLL. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 113-120.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136.
Virtanen, T., Gemmeke, J., & Raj, B. (2013). Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio. IEEE Transactions on Audio, Speech, and Language Processing., vol. 21, no. 11, pp. 2277-2289.
Wang, D., & Lim, J. (1982). The unimportance of phase in speech enhancement. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679-681.
Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849-1858.
Williamson et al. (2016). Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483-492.
Williamson, D. S., & Wang, D. L. (2017). Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1492-1501.
Wu, M., & Wang, D. L. (2006). “A two-stage algorithm for one-microphone reverberant speech enhancement,”. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 14, no. 3, pp. 774–784.
Xiao et al. (2016). “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,”. EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, pp. 1-18.
Yoshioka, T., & Nakatani, T. (2012). “Generalization of multi-channel linear rediction methods for blind MIMO impulse response shortening,”. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720.
Zhang, X. L., & Wang, D. L. (2016). “A deep ensemble learning method for monaural speech separation,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 967–977.
中華民國衛生福利部統計處. (2020). 擷取自 https://dep.mohw.gov.tw/DOS/cp-2976-13827-113.html
高士喆. (2014). 「語音增強使用感知激勵頻譜振福之貝氏估計器」. 國立台北科技大學電機工程研究所. 碩士論文.
陳星瑋. (2019). 「基於深度神經網路之多聲道聲源方位估計與語音增強」. 國立交通大學電信工程研究所. 碩士論文.
黃國原. (2009). 「模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識之影響」. 國立中央大學電機工程研究所. 碩士論文.
黃銘緯. (2005). 「台灣地區噪音下漢語語音聽辨測試」. 國立台北護理學院聽語障礙科學研究所. 碩士論文.
楊宗翰. (2012). 「使用適應波束形成與增益衰減後濾波器之殘響消除方法」. 國立交通大學電機與控制工程研究所. 碩士論文.