非對稱摺積神經網路之聲音場景分類｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	伍聿旂 Yu-Chi Wu
論文名稱：	非對稱摺積神經網路之聲音場景分類 Asymmetric Kernel Convolutional Neural Network for Acoustic Scenes Classification
指導教授：	張寶基
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 通訊工程學系 Department of Communication Engineering
論文出版年：	2017
畢業學年度：	105
語文別：	中文
論文頁數：	80
中文關鍵詞：	計算聽覺場景分析、聲音場景辨分類、深度學習、摺積神經網路
外文關鍵詞：	Computational Auditory Scene Analysis, Acoustic scenes classification, Deep learning, Convolutional neural network
相關次數：	點閱：16 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著人類追求便利性，我們使用電腦使其學習並了解人類所熟知的事物，我們希望通過分析聲音使電腦認識自己的環境，自2013年首次舉辦IEEE Audio and Acoustic Signal Processing (AASP) 聲音場景與事件辨識(Detection and Classification of Acoustic Scenes and Events, DCASE) 競賽，掀起了聲音場景分類 (Acoustic scene classification, ASC)的風波，邁向統一ASC的資料庫與評估方法的第一步，更於2016年舉辦第二屆 DCASE2016競賽。
本論文利用深度學習中的摺積神經網路 (Convolutional Neural Net-work, CNN) 作為ASC的方法。由於CNN之輸入資料為頻譜，而頻譜包含時域資訊與頻域資訊，因此我們假設時域資訊與頻域資訊的資料變化量不一，因此使用長形的摺積核 (kernel) ，也就是本論文提出之非對稱摺積核 (Asymmetric Kernel) (相對於以往的方形的對稱摺積核)，並在訓練期間做資料正規化 (Normalization)加速訓練。我們發現即使現在多以寬又深的網路作為趨勢，發展更佳的資料分類方法，但其實本論文所提出的架構，兩層不用預訓練 (Pre-train)的CNN即可達到相較DCASE2016排名第五名更佳的效果。

Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge have held in three times. The first DCASE Challenge was held in 2013. Then, DCASE2016 Challenge was the 2nd times of DCASE Challenge. The result why IEEE Audio and Acoustic Signal Processing (AASP) held the 2nd challenge after 3 years is to reset a brand new dataset and united the rule of ASC.
In this work, we use the dataset of ASC from DCASE2016 to propose an Asymmetric Kernel Convolutional Neural Network (AKCNN), whose kernel shape is very different from the traditionally squared kernel. The width and height of the kernel are asymmetric which means that the shape of the kernel is a rectangular kernel. Also, the proposed uses weight normalization (WN) to accelerate the training time because it can early converge the training loss and testing accuracy during training. The best of all, WN can help increase the accuracy of ASC. The result shows that AKCNN achieves accuracy 86.7%. If we rank the score in DCASE2016 ASC Challenge, it would show that we have a better score than the 5th place.

摘要    i
Abstract    ii
誌謝    iii
目錄    v
圖目錄    viii
表目錄    xi
第一章    緒論    1
1-1    研究動機與背景    1
1-2    論文架構    3
第二章    聲音場景分類    4
2-1    聲音場景分類發展史    4
2-1-1    2013聲音場景與事件的分類與偵測競賽    5
2-1-2    2016與2017聲音場景與事件的分類與偵測競賽    6
2-2    聲音場景分類特徵    7
2-2-1    對數梅爾刻度頻譜    8
2-2-2    梅爾倒頻譜係數    11
第三章    神經網路與深度學習    13
3-1    類神經網路    13
3-1-1    類神經網路發展史    14
3-1-2    反向傳播演算法    17
3-2    深度學習    20
3-2-1    深度神經網路    20
3-2-2    摺積神經網路    23
3-3    正規化加速訓練    26
3-3-1    批次資料正規化    26
3-3-2    權重正規化    31
第四章    提出之架構    33
4-1    資料前處理    33
4-1-1    特徵提取    34
4-1-2    資料正規化    35
4-1-3    資料切割與堆疊    36
4-2    摺積神經網路架構    37
4-2-1    訓練階段    40
4-2-2    測試階段    41
第五章    實驗與分析    43
5-1    實驗環境與資料庫    43
5-2    參數選擇實驗    46
5-3    實驗結果比較與分析    56
第六章    結論與未來展望    60
參考文獻    61

                                

[1] D. Wang and G. J. Brown, “Computational Auditory Scene Analysis: Prin-ciples, Algorithms, and Applications”. Wiley-IEEE Press, 2006.
[2] A. S. Bregman, “Auditory Scene Analysis,” MIT Press, Cambridge, MA, 1990.
[3] M. Slaney, “The History and Future of CASA,” Speech separation by hu-mans and machines, pp.199-211, Springer US, 2005.
[4] N. Sawhney, “Situational Awareness from Environmental Sounds,” Tech-nical Report, Massachusetts Institute of Technology, 1997.
[5] D. Barchiesi, D. Giannoulis, D. Stowell, M. D. Plumbley, “Acoustic Scene Classification,” in IEEE Signal Processing Magazine, vol. 32, no. 3, pp.16-34, May 2015.
[6] S. McAdams, “Recognition of sound sources and events,” Thinking in Sound: The Cognitive Psychology of Human Audition, pp. 146-198, 1993.
[7] H. E. Zadeh, B. Lehner, M. Dorfer and G. Widmer, “CP-JKU Submissions for DCASE-2016: A Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[8] M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[9] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Public Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification, retrieved Jun. 29, 2017.
[10] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Private Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification_testset, retrieved Jun. 29, 2017.
[11] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, De-velopment dataset, http://doi.org/10.5281/zenodo.45739, retrieved Dec. 1, 2016.
[12] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, Eval-uation dataset, https://zenodo.org/record/165995#.WXblsYiGNhE, re-trieved Dec. 1, 2016.
[13] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 201 108, v1.1.3, Sep. 2003.
[14] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 202 050, v1.1.5, Jan. 2007.
[15] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[16] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Pro-ceedings of the 14th Python in Conference, Jul. 2015.
[17] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
[18] C. Szegedy, et al. “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, Jun. 2015.
[19] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, pp.1097-1105, 2012.
[20] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp.115-133, Dec. 1943.
[21] D. O. Hebb, “Organization of Behavior,” New York: Wiley & Sons.
[22] N. Rochester, J. Holland, L. Haibt, W. Duda, “Tests on A Cell Assembly Theory of the Action of the Brain, Using A Large Digital Computer”
[23] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review, v. 65, no. 6, pp. 386–408.
[24] F. Rosenblatt, “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms,” Spartan Books, Washington DC, 1961.
[25] M. Minsky and S. Paper, “Perceptrons,” Cambridge, MA: MIT Press.
[26] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa-tions by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
[28] V. Nair, and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Jun. 2010.
[29] S. Sigtia, and S. Dixon, "Improved Music Feature Learning with Deep Neural Networks," in 2014 IEEE International Conference on Acoustics, speech and signal processing (ICASSP), pp. 6959-6963, May 2014.
[30] N. Srivastava, G. E. Hinton, A. Krizhevsky, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," in Journal of Machine Learn-ing Research, vol. 15, pp. 1929-1958. Jun. 2014.
[31] Q. Kong, I. Sobieraj, W. Wang and M. Plumbley, “Deep Neural Network Baseline for DCASE Challenge 2016,” in 2016 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016), pp. 50-54, Sep. 2016.
[32] Z. Liao, G. Carneiro. "Competitive Multi-Scale Convolution," arXiv pre-print arXiv:1511.05635, 2015.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[34] I. Mrazova, and M. Kukacka, “Hybrid convolutional neural networks,” in 6th IEEE International Conference on Industrial Informatics (INDIN), 2008.
[35] M. Lin, Q. Chen, and S. Yan, “Network in Network,” in Computing Re-search Repository (CoRR), 2013.
[36] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning, pp. 448-456, 2015.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[38] T. Salimans and D. P. Kingma, “Weight Normalization: A Simple Repa-rameterization to Accelerate Training of Deep Neural Networks,” in Ad-vances in Neural Information Processing Systems, pp. 901-909, 2016.
[39] TensorFlow: an open source Python package for machine intelligence, https://www.tensorflow.org, retrieved Dec. 1, 2016.
[40] J. Dean, et al. “Large-Scale Deep Learning for Building Intelligent Com-puter Systems,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 1-1, Feb. 2016.
[41] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” IEEE 2016 24th Euro-pean Signal Processing Conference, pp. 1128-1132, Aug. 2016.
[42] DCASE2017 Challenge Baseline website, http://doi.org/10.5281/zenodo.400515, retrieved Mar. 17, 2017.
[43] DCASE2016 Challenge website, http://www.cs.tut.fi/sgn/arg/dcase2016/task-results-acoustic-scene-classification, retrieved Jun. 26, 2017.
[44] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016.

簡易檢索 / 詳目顯示

相關論文