跳到主要內容

簡易檢索 / 詳目顯示

研究生: 林源煜
Lim Yuan Yu
論文名稱: On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy
指導教授: 陳弘軒
Hung-Hsuan Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 人工智慧國際碩士學位學程
International Master's Degree program in Artificial Intelligence
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 65
中文關鍵詞: 語音增強擴散模型音訊頻譜轉換器頻譜熵
外文關鍵詞: Audio Spectrogram Transforme, DNSMOS, Spectral Entropy
相關次數: 點閱:60下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 擴散機率模型(Diffusion probabilistic models)已成為語音增強(Speech Enhancement, SE)領域的最新頂尖技術,能夠生成高保真音訊。然而,其在不同模型與聲學條件下顯著的性能差異,往往阻礙了它們的實際應用。不僅幾乎不存在一個普適性的最佳模型,學界對於是何種輸入訊號特徵決定了特定增強方法的成敗,也缺乏足夠的理解。本論文為應對上述挑戰,提出了一套新穎的兩階段智慧模型推薦系統,旨在針對給定的帶噪輸入,動態地選擇最合適的語音增強模型。為此,我們首先引入了一組基於交叉熵(Cross-Entropy)與KL散度(KL-Divergence)的頻譜特徵。這些特徵經證明在描述增強任務的難易度以及識別不同模型的特定優勢領域上,具有統計顯著性。

    我們提出的推薦系統採用「守門員-專家」(gatekeeper-expert)架構,以有效處理模型選擇任務中固有的嚴重類別不平衡問題。該系統的訓練,是基於對三個主流擴散模型(SGMSE+、StoRM及CDiffuSE)的全面評估。大量實驗證明,使用經過微調的預訓練骨幹網路,如EfficientNet-B0和音訊頻譜轉換器(AST),在推薦任務上取得了很高的分類準確率。消融實驗證實,將梅爾頻譜圖(Mel-spectrograms)與我們提出的頻譜特徵結合做為混合式輸入,能夠進一步提升模型性能。

    至關重要的是,端對端的評估結果顯示,與通用地應用任一單一基準模型相比,由本推薦系統驅動的方法所達成的平均語音增強品質(以DNSMOS指標衡量),更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案,也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用,提供了一個更深入的分析框架。


    Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach.

    This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths.

    Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance.

    Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.

    Table of Contents 1 Introduction 1 1.1 Overview of Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background and Literature Review 5 2.1 Evolution of Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Traditional Speech Enhancement Techniques . . . . . . . . . . . . . 5 2.1.2 Deep Learning-based Speech Enhancement . . . . . . . . . . . . . . 6 2.2 Diffusion Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Discrete-Time Models (DDPMs) . . . . . . . . . . . . . . . . . . . 6 2.2.2 Continuous-Time Models (SDEs) . . . . . . . . . . . . . . . . . . . 7 2.3 Baseline Diffusion-based Speech Enhancement Models . . . . . . . . . . . . 7 2.3.1 SGMSE+: The SDE-based Conditional Approach . . . . . . . . . . 8 2.3.2 CDiffuSE: The DDPM-based Conditional Approach . . . . . . . . . 8 2.3.3 StoRM: The Two-Stage Regenerative Approach . . . . . . . . . . . 9 2.4 Related Spectral Representation Techniques . . . . . . . . . . . . . . . . . 9 2.4.1 Spectral Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 vii 2.5 Evaluation Metrics for Speech Enhancement . . . . . . . . . . . . . . . . . 10 2.5.1 Intrusive Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5.2 Non-Intrusive Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Methodology 13 3.1 Experimental Design for the Recommender System . . . . . . . . . . . . . 13 3.1.1 Enhancement Model Recommendation System . . . . . . . . . . . . 13 3.1.2 Enhancement Model Recommender Architectures . . . . . . . . . . 14 3.1.3 Proposed Two-Stage ”Gatekeeper-Expert” Recommender System . 15 3.2 Proposed Spectral Feature Extraction . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 A Priori Rationale for Feature Design . . . . . . . . . . . . . . . . 17 3.2.2 Cross-Entropy (CE) . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Kullback-Leibler (KL) Divergence . . . . . . . . . . . . . . . . . . . 19 3.2.4 Feature Calculation from CE and KL Matrices . . . . . . . . . . . 19 3.3 Model Architectures for Analytical Experiments . . . . . . . . . . . . . . . 20 3.3.1 Details of Pre-trained Models . . . . . . . . . . . . . . . . . . . . . 20 4 Results and Discussions 23 4.1 Dataset and Experiment design . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Premilinary Observation: Comparative Performance of Speech Enhance- ment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 Categorization of Enhancement Outcomes . . . . . . . . . . . . . . 25 4.2.2 Analysis of Top-Performing Models per Sample . . . . . . . . . . . 27 4.2.3 Feature Characteristics for Top-Performing Models and Universally Challenging Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Performance of the Enhancement Model Recommender System . . . . . . . 29 viii 4.4 Impact of Recommendation-Driven Enhancement on Speech Quality . . . . 30 4.5 Ablation studies and discussions . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5.1 Analysis of spectral features in characterizing outcome . . . . . . . 32 4.5.2 Analysis of Recommender Design via Ablation Studies . . . . . . . 34 4.5.3 Feature analysis on misclassified samples . . . . . . . . . . . . . . . 39 5 Discussion 41 5.1 The Role and Significance of the Proposed Spectral Features . . . . . . . . 41 5.1.1 Characterizing Enhancement Difficulty and Failure Modes . . . . . 41 5.1.2 Identifying Model-Specific Strengths . . . . . . . . . . . . . . . . . 42 5.1.3 Value as a Complementary Input for Advanced Classifiers . . . . . 42 5.2 Potential for Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Robustness to Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Conclusion and Future works 44 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.1 Improvement of Diffusion Model Architectures for Enhancement . . 45 6.2.2 Improved Noise Dataset Collection and Characterization . . . . . . 46 Bibliography 47 A Implementation 51

    [1] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2007.
    [2] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12, pp. 91–99, Springer, 2015.
    [3] X. Li, Y. Li, Y. Dong, S. Xu, Z. Zhang, D. Wang, and S. Xiong, “Bidirectional lstm network with ordered neurons for speech enhancement.,” in Interspeech, pp. 2702–2706, 2020.
    [4] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 156–165, 2017.
    [5] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
    [6] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in ICASSP 2019-2019 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6875–6879, IEEE, 2019.
    [7] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
    [8] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
    [9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted
    intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241, Springer, 2015.
    [10] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint
    arXiv:2011.13456, 2020.
    [11] B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982.
    [12] P. Vincent, “A connection between score matching and denoising autoencoders,”Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
    [13] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM
    Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364,2023.
    [14] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7402–7406, Ieee, 2022.
    [15] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023.
    [16] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy based feature for robust asr,” in 2004 IEEE International Conference on Acoustics, Speech, and
    Signal Processing, vol. 1, pp. I–193, IEEE, 2004.
    [17] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and
    codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 vol.2, 2001.
    [18] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
    [19] C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021 -
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497, 2021.
    [20] C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022 -
    2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886–890, 2022.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
    pp. 770–778, 2016.
    [22] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
    [23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 248–255, 2009.
    [24] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
    [25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint
    arXiv:2010.11929, 2020.
    [26] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, (New Orleans, LA), 2017.
    [27] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,” Proc. Interspeech 2019,
    pp. 1816–1820, 2019.
    [28] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions
    on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
    [29] K. Ito and L. Johnson, “The lj speech dataset.” https://keithito.com/LJ-Speech-Dataset/, 2017.
    [30] C. V. Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech,” in 9th ISCA speech synthesis workshop, pp. 159–165, 2016.
    [31] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in 2013 International
    Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–4, 2013.
    [32] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics, vol. 19, p. 035081, Acoustical Society of America, 2013.
    [33] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.

    QR CODE
    :::