跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳定言
Ding-Yan Chen
論文名稱: Adaf-Spectrogram:基於能量分布之自適應頻率軸頻譜圖設計
Adaf-Spectrogram:An Adaptive Frequency-Axis Spectrogram Designed from Energy Distribution
指導教授: 陳弘軒
Hung-Hsuan Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 77
中文關鍵詞: 頻譜圖自適應頻譜圖時間序列時頻分析
外文關鍵詞: Spectrogram, Adaptive Spectrogram, Time Series, Time-Frequency Analysis
相關次數: 點閱:71下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在當代訊號處理與人工智慧應用領域中,頻譜圖(Spectrogram)作為一種將時間域訊號轉換為時頻域的視覺化表示方式,已廣泛應用於人體動作辨識、生物醫學分析、語音識別以及環境聲音分類等多種研究領域。其中,梅爾頻譜圖(Mel-Spectrogram)作為頻譜圖的一種重要衍生,因其模擬人耳對頻率的感知特性,對頻率軸進行非線性壓縮,能更有效保留語音訊號中的語意與韻律結構,已成為目前最常用且具表達力的聲學特徵之一。透過直觀且細膩的時頻資訊呈現,頻譜圖可有效揭露訊號中潛在的時變頻率特性,並為深度學習模型提供高辨識度的輸入特徵,特別是在卷積神經網路(CNN)架構中展現出卓越的分類與辨識能力。

    本論文基於上述兩種頻譜圖,提出了一種自適應頻率軸頻譜圖(Adaptive Frequency Spectrogram, Adaf-Spectrogram)。該頻譜圖透過計算整體資料集的頻率能量分布,自動調整頻率軸的尺度縮放,以更有效地突顯訊號中的關鍵頻率特徵。實驗結果證明,此自適應頻率軸頻譜圖在多種資料集上均具良好適應性,並且在辨識效果上優於傳統頻譜圖(Spectrogram),展現出顯著的性能提升。


    In the fields of modern signal processing and artificial intelligence, the spectrogram is a fundamental visual representation that transforms time-domain signals into the time-frequency domain. It has found extensive applications in areas such as human activity recognition, biomedical signal analysis, speech recognition, and environmental sound classification. Among these, the Mel-spectrogram is a prominent variant. By emulating the human auditory system's perception of frequency through non-linear compression of the frequency axis, it more effectively preserves semantic and prosodic information in speech signals. Consequently, it has become one of the most expressive and widely-adopted acoustic features.
    With its intuitive yet detailed time-frequency representation, the spectrogram effectively reveals latent time-variant frequency characteristics within a signal, providing highly discriminative input features for deep learning models. It has demonstrated exceptional performance in classification and recognition tasks, particularly within Convolutional Neural Network (CNN) architectures.

    Building upon these established representations, this paper proposes a novel Adaptive Frequency Spectrogram (Adaf-Spectrogram). This data-driven method automatically adjusts the frequency axis scaling by computing the overall frequency energy distribution across an entire dataset, thereby more effectively emphasizing critical frequency features. Experimental results demonstrate that the proposed Adaf-Spectrogram exhibits excellent adaptability across multiple datasets. Furthermore, it outperforms conventional linear-scale spectrograms in recognition tasks, showcasing a significant performance improvement.

    摘要ix Abstract xi 誌謝xiii 目錄xv 一、緒論1 二、相關研究7 2.1頻譜圖(Spectrogram) ................................................. 7 2.2梅爾頻譜圖(Mel-Spectrogram)...................................... 8 2.3其他時頻分析方法...................................................... 9 2.4頻譜圖在深度學習應用................................................ 11 三、研究方法13 3.1資料前處理與輸入格式................................................ 14 3.2短時傅立葉轉換(STFT)........................................... 14 3.3頻率能量累積與分段................................................... 16 3.3.1頻率能量統計與分布估計.................................... 16 3.3.2自適應頻率軸區段演算法.................................... 21 3.4建立自適應頻率軸頻譜圖............................................. 24 四、實驗結果與分析27 4.1資料集介紹............................................................... 27 4.2實驗設置與實作細節................................................... 30 4.2.1頻譜圖產生與預處理.......................................... 30 4.2.2模型架構與訓練設定.......................................... 30 4.3各類頻譜圖之比較...................................................... 34 4.3.1 CNN架構實驗結果分析...................................... 35 4.3.2 ViT架構實驗結果分析....................................... 36 4.3.3小結............................................................... 38 4.4自適應頻率軸頻譜圖之區段數比較................................. 39 4.5自適應頻率軸頻譜圖之能量聚合比較.............................. 40 4.6自適應頻率軸頻譜圖之卷積核大小比較........................... 41 五、討論43 六、總結45 6.1結論........................................................................ 45 6.2未來展望.................................................................. 46 參考文獻47 附錄A實驗程式碼53 附錄BVisionTransformer(ViT)實驗補充55 B.1地震資料集(MicSigV1)............................................. 55 B.2音訊資料集(ESC50與BSC5).................................... 56 附錄C頻譜圖視覺化補充57 C.1音訊資料集(ESC50)................................................. 57

    [1] M. Bellanger, “Digital processing of speech signals,” IEEE Transactions on Acous-
    tics, Speech, and Signal Processing, vol. 28, no. 1, pp. 118–119, 1980. doi: 10.1109/
    TASSP.1980.1163362.
    [2] U. Saha, S. Saha, T. Kabir, S. A. Fattah, and M. Saquib, Decoding human activities:
    Analyzing wearable accelerometer and gyroscope data for activity recognition, 2024.
    arXiv: 2310.02011 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2310.
    02011.
    [3] J. Vaghasiya, C. Mayorga-Martinez, and M. Pumera, “Wearable sensors for tele-
    health based on emerging materials and nanoarchitectonics,” npj Flexible Electron-
    ics, vol. 7, p. 26, 2023. doi: 10.1038/s41528-023-00261-4. [Online]. Available:
    https://doi.org/10.1038/s41528-023-00261-4.
    [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
    convolutional neural networks,” in Advances in Neural Information Processing Sys-
    tems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25, Curran
    Associates, Inc., 2012. [Online]. Available: https://proceedings.neurips.cc/
    paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.
    pdf.
    [5] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale
    image recognition, 2015. arXiv: 1409.1556 [cs.CV]. [Online]. Available: https:
    //arxiv.org/abs/1409.1556.
    [6] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Ad-
    vances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S.
    Bengio, et al., Eds., https://proceedings.neurips.cc/paper_files/paper/
    2017 / file / 3f5ee243547dee91fbd053c1c4a845aa - Paper . pdf, vol. 30, Curran
    Associates, Inc., 2017.
    [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep
    bidirectional transformers for language understanding, 2019. arXiv: 1810 . 04805
    [cs.CL]. [Online]. Available: https://arxiv.org/abs/1810.04805.
    [8] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time
    series,” in The Handbook of Brain Theory and Neural Networks. Cambridge, MA,
    USA: MIT Press, 1998, pp. 255–258, isbn: 0262511029.
    [9] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo, Convolutional
    lstm network: A machine learning approach for precipitation nowcasting, 2015.
    arXiv: 1506.04214 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1506.
    04214.
    [10] Z. Wang, W. Yan, and T. Oates, Time series classification from scratch with deep
    neural networks: A strong baseline, 2016. arXiv: 1611.06455 [cs.LG]. [Online].
    Available: https://arxiv.org/abs/1611.06455.
    [11] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Muller, “Deep
    learning for time series classification: A review,” Data Mining and Knowledge Dis-
    covery, vol. 33, no. 4, pp. 917–963, Mar. 2019, issn: 1573-756X. doi: 10.1007/
    s10618-019-00619-1. [Online]. Available: http://dx.doi.org/10.1007/s10618-
    019-00619-1.
    [12] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, A transformer-
    based framework for multivariate time series representation learning, 2020. arXiv:
    2010.02803 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2010.02803.
    [13] L. Cohen, Time-Frequency Analysis. Upper Saddle River, NJ: Prentice Hall, 1995.
    [14] L. Tang, Y. Jia, Y. Qian, S. Yi, and P. Yuan, “Human activity recognition based on
    mixed cnn with radar multi-spectrogram,” IEEE Sensors Journal, vol. 21, no. 22,
    pp. 25 950–25 962, 2021. doi: 10.1109/JSEN.2021.3118836.
    [15] L. Qu, Y. Wang, T. Yang, and Y. Sun, “Human activity recognition based on wrgan-
    gp-synthesized micro-doppler spectrograms,” IEEE Sensors Journal, vol. 22, no. 9,
    pp. 8960–8973, 2022. doi: 10.1109/JSEN.2022.3164152.
    [16] J. Huang, B. Chen, B. Yao, and W. He, “Ecg arrhythmia classification using
    stft-based spectrogram and convolutional neural network,” IEEE Access, vol. 7,
    pp. 92 871–92 880, 2019. doi: 10.1109/ACCESS.2019.2928017.
    [17] S. Y. ŞEN and N. ÖZKURT, “Ecg arrhythmia classification by using convolutional
    neural network and spectrogram,” in 2019 Innovations in Intelligent Systems and
    Applications Conference (ASYU), 2019, pp. 1–6. doi: 10.1109/ASYU48272.2019.
    8946417.
    [18] H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the
    American Institute of Electrical Engineers, vol. 47, pp. 617–644, 1928.
    [19] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 2nd. Upper
    Saddle River, NJ: Prentice Hall, 1999.
    [20] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of
    the psychological magnitude of pitch,” The Journal of the Acoustical Society of
    America, vol. 8, no. 3, pp. 185–190, 1937.
    [21] K. Choi, G. Fazekas, K. Cho, and M. Sandler, A tutorial on deep learning for
    music information retrieval, 2018. arXiv: 1709.04396 [cs.CV]. [Online]. Available:
    https://arxiv.org/abs/1709.04396.
    [22] K. J. Piczak, “Environmental sound classification with convolutional neural net-
    works,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal
    Processing (MLSP), 2015, pp. 1–6. doi: 10.1109/MLSP.2015.7324337.
    [23] D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep
    convolutional neural networks,” Biomedical Signal Processing and Control, vol. 59,
    p. 101 894, 2020, issn: 1746-8094. doi: 10.1016/j.bspc.2020.101894.
    [24] L. Abdel Khaliq, S. Janzen, and W. Maass, “Reaver: Real-time earthquake pre-
    diction with attention-based sliding-window spectrograms,” in Proceedings of the
    Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24,
    K. Larson, Ed., Demo Track, International Joint Conferences on Artificial Intelli-
    gence Organization, Aug. 2024, pp. 8596–8600. doi: 10.24963/ijcai.2024/988.
    [Online]. Available: https://doi.org/10.24963/ijcai.2024/988.
    [25] X. Li, G. A. Ng, and F. S. Schlindwein, “Transfer learning in heart sound classifi-
    cation using mel spectrogram,” in 2022 Computing in Cardiology (CinC), vol. 498,
    2022, pp. 1–4. doi: 10.22489/CinC.2022.046.
    [26] M. Quadrini, E. Abdullah, N. Francioni, et al., “Sleep apnea detection using mel-
    spectrograms snoring and convolutional neural networks,” in 2024 IEEE Interna-
    tional Conference on Bioinformatics and Biomedicine (BIBM), 2024, pp. 6506–
    6512. doi: 10.1109/BIBM62325.2024.10822147.
    [27] W. Koenig, H. K. Dunn, and L. Y. Lacy, “The sound spectrograph,” Journal of the
    Acoustical Society of America, vol. 18, no. 1, pp. 19–49, 1946.
    [28] P. Arun, S. A. Lincon, and N. Prabhakaran, “An automated method for the analysis
    of bearing vibration based on spectrogram pattern matching,” Journal of Applied
    Research and Technology, vol. 17, no. 2, 2019. doi: 10.22201/icat.16656423.
    2019 . 17 . 2 . 805. [Online]. Available: https : / / doi . org / 10 . 22201 / icat .
    16656423.2019.17.2.805.
    [29] N. A. Tawhid, S. Siuly, H. Wang, F. Whittaker, K. Wang, and Y. Zhang, “A spec-
    trogram image based intelligent technique for automatic detection of autism spec-
    trum disorder from eeg,” PLOS ONE, vol. 16, no. 6, e0253094, 2021, https://jour-
    nals.plos.org/plosone/article/file?id=10.1371/journal.pone.0253094type=printable.
    doi: 10 . 1371 / journal . pone . 0253094. [Online]. Available: https : / / app .
    dimensions.ai/details/publication/pub.1139158677.
    [30] D.-H. Park, M.-W. Jeon, and H.-N. Kim, Resolution-adaptive micro-doppler spectro-
    gram for human activity recognition, 2024. arXiv: 2411.15057 [eess.SP]. [Online].
    Available: https://arxiv.org/abs/2411.15057.
    [31] S. Davis and P. Mermelstein, “Comparison of parametric representations for mono-
    syllabic word recognition in continuously spoken sentences,” IEEE Transactions on
    Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980. doi:
    10.1109/TASSP.1980.1163420.
    [32] J. C. Brown, “Calculation of a constant q spectral transform,” The Journal of the
    Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991.
    [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16x16 words:
    Transformers for image recognition at scale, 2021. arXiv: 2010.11929 [cs.CV].
    [Online]. Available: https://arxiv.org/abs/2010.11929.
    [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-
    scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision
    and Pattern Recognition, 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.
    [35] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition,
    2015. arXiv: 1512.03385 [cs.CV]. [Online]. Available: https://arxiv.org/abs/
    1512.03385.
    [36] M. Tan and Q. V. Le, Efficientnet: Rethinking model scaling for convolutional neural
    networks, 2020. arXiv: 1905.11946 [cs.LG]. [Online]. Available: https://arxiv.
    org/abs/1905.11946.
    [37] L. Pham, P. Lam, T. Nguyen, H. Nguyen, and A. Schindler, “Deepfake audio de-
    tection using spectrogram-based feature and ensemble of deep learning models,” in
    2024 IEEE 5th International Symposium on the Internet of Sounds (IS2), 2024,
    pp. 1–5. doi: 10.1109/IS262782.2024.10704095.
    [38] Y. Gong, Y.-A. Chung, and J. Glass, Ast: Audio spectrogram transformer, 2021.
    arXiv: 2104.01778 [cs.SD]. [Online]. Available: https://arxiv.org/abs/2104.
    01778.
    [39] Instituto Geofísico –EPN, Formulario de registros acelerográficos (señales sísmi-
    cas), https://www.igepn.edu.ec/senales- sismicas/fomulario- eseismic,
    Accessed: 2025-07-23, 2025.
    [40] A. Salazar, R. Arroyo, N. Pérez, and D. Benítez, “Deep-learning for volcanic seis-
    mic events classification,” in 2020 IEEE Colombian Conference on Applications of
    Computational Intelligence (IEEE ColCACI 2020), 2020, pp. 1–6. doi: 10.1109/
    ColCACI50549.2020.9247848.
    [41] K. J. Piczak, “Esc: Dataset for environmental sound classification,” in Proceed-
    ings of the 23rd ACM International Conference on Multimedia, ser. MM ’15,
    Brisbane, Australia: Association for Computing Machinery, 2015, pp. 1015–1018,
    isbn: 9781450334594. doi: 10.1145/2733373.2806390. [Online]. Available: https:
    //doi.org/10.1145/2733373.2806390.
    [42] V. Shanbhag, Bird song data set, https://www.kaggle.com/datasets/vinayshanbhag/
    bird-song-data-set, Accessed: 2025-07-23, 2022.

    QR CODE
    :::