| 研究生: |
伍聿旂 Yu-Chi Wu |
|---|---|
| 論文名稱: |
非對稱摺積神經網路之聲音場景分類 Asymmetric Kernel Convolutional Neural Network for Acoustic Scenes Classification |
| 指導教授: | 張寶基 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 通訊工程學系 Department of Communication Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 中文 |
| 論文頁數: | 80 |
| 中文關鍵詞: | 計算聽覺場景分析 、聲音場景辨分類 、深度學習 、摺積神經網路 |
| 外文關鍵詞: | Computational Auditory Scene Analysis, Acoustic scenes classification, Deep learning, Convolutional neural network |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人類追求便利性,我們使用電腦使其學習並了解人類所熟知的事物,我們希望通過分析聲音使電腦認識自己的環境,自2013年首次舉辦IEEE Audio and Acoustic Signal Processing (AASP) 聲音場景與事件辨識(Detection and Classification of Acoustic Scenes and Events, DCASE) 競賽,掀起了聲音場景分類 (Acoustic scene classification, ASC)的風波,邁向統一ASC的資料庫與評估方法的第一步,更於2016年舉辦第二屆 DCASE2016競賽。
本論文利用深度學習中的摺積神經網路 (Convolutional Neural Net-work, CNN) 作為ASC的方法。由於CNN之輸入資料為頻譜,而頻譜包含時域資訊與頻域資訊,因此我們假設時域資訊與頻域資訊的資料變化量不一,因此使用長形的摺積核 (kernel) ,也就是本論文提出之非對稱摺積核 (Asymmetric Kernel) (相對於以往的方形的對稱摺積核),並在訓練期間做資料正規化 (Normalization)加速訓練。我們發現即使現在多以寬又深的網路作為趨勢,發展更佳的資料分類方法,但其實本論文所提出的架構,兩層不用預訓練 (Pre-train)的CNN即可達到相較DCASE2016排名第五名更佳的效果。
Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge have held in three times. The first DCASE Challenge was held in 2013. Then, DCASE2016 Challenge was the 2nd times of DCASE Challenge. The result why IEEE Audio and Acoustic Signal Processing (AASP) held the 2nd challenge after 3 years is to reset a brand new dataset and united the rule of ASC.
In this work, we use the dataset of ASC from DCASE2016 to propose an Asymmetric Kernel Convolutional Neural Network (AKCNN), whose kernel shape is very different from the traditionally squared kernel. The width and height of the kernel are asymmetric which means that the shape of the kernel is a rectangular kernel. Also, the proposed uses weight normalization (WN) to accelerate the training time because it can early converge the training loss and testing accuracy during training. The best of all, WN can help increase the accuracy of ASC. The result shows that AKCNN achieves accuracy 86.7%. If we rank the score in DCASE2016 ASC Challenge, it would show that we have a better score than the 5th place.
[1] D. Wang and G. J. Brown, “Computational Auditory Scene Analysis: Prin-ciples, Algorithms, and Applications”. Wiley-IEEE Press, 2006.
[2] A. S. Bregman, “Auditory Scene Analysis,” MIT Press, Cambridge, MA, 1990.
[3] M. Slaney, “The History and Future of CASA,” Speech separation by hu-mans and machines, pp.199-211, Springer US, 2005.
[4] N. Sawhney, “Situational Awareness from Environmental Sounds,” Tech-nical Report, Massachusetts Institute of Technology, 1997.
[5] D. Barchiesi, D. Giannoulis, D. Stowell, M. D. Plumbley, “Acoustic Scene Classification,” in IEEE Signal Processing Magazine, vol. 32, no. 3, pp.16-34, May 2015.
[6] S. McAdams, “Recognition of sound sources and events,” Thinking in Sound: The Cognitive Psychology of Human Audition, pp. 146-198, 1993.
[7] H. E. Zadeh, B. Lehner, M. Dorfer and G. Widmer, “CP-JKU Submissions for DCASE-2016: A Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[8] M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[9] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Public Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification, retrieved Jun. 29, 2017.
[10] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Private Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification_testset, retrieved Jun. 29, 2017.
[11] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, De-velopment dataset, http://doi.org/10.5281/zenodo.45739, retrieved Dec. 1, 2016.
[12] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, Eval-uation dataset, https://zenodo.org/record/165995#.WXblsYiGNhE, re-trieved Dec. 1, 2016.
[13] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 201 108, v1.1.3, Sep. 2003.
[14] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 202 050, v1.1.5, Jan. 2007.
[15] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[16] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Pro-ceedings of the 14th Python in Conference, Jul. 2015.
[17] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
[18] C. Szegedy, et al. “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, Jun. 2015.
[19] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, pp.1097-1105, 2012.
[20] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp.115-133, Dec. 1943.
[21] D. O. Hebb, “Organization of Behavior,” New York: Wiley & Sons.
[22] N. Rochester, J. Holland, L. Haibt, W. Duda, “Tests on A Cell Assembly Theory of the Action of the Brain, Using A Large Digital Computer”
[23] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review, v. 65, no. 6, pp. 386–408.
[24] F. Rosenblatt, “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms,” Spartan Books, Washington DC, 1961.
[25] M. Minsky and S. Paper, “Perceptrons,” Cambridge, MA: MIT Press.
[26] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa-tions by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
[28] V. Nair, and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Jun. 2010.
[29] S. Sigtia, and S. Dixon, "Improved Music Feature Learning with Deep Neural Networks," in 2014 IEEE International Conference on Acoustics, speech and signal processing (ICASSP), pp. 6959-6963, May 2014.
[30] N. Srivastava, G. E. Hinton, A. Krizhevsky, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," in Journal of Machine Learn-ing Research, vol. 15, pp. 1929-1958. Jun. 2014.
[31] Q. Kong, I. Sobieraj, W. Wang and M. Plumbley, “Deep Neural Network Baseline for DCASE Challenge 2016,” in 2016 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016), pp. 50-54, Sep. 2016.
[32] Z. Liao, G. Carneiro. "Competitive Multi-Scale Convolution," arXiv pre-print arXiv:1511.05635, 2015.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[34] I. Mrazova, and M. Kukacka, “Hybrid convolutional neural networks,” in 6th IEEE International Conference on Industrial Informatics (INDIN), 2008.
[35] M. Lin, Q. Chen, and S. Yan, “Network in Network,” in Computing Re-search Repository (CoRR), 2013.
[36] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning, pp. 448-456, 2015.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[38] T. Salimans and D. P. Kingma, “Weight Normalization: A Simple Repa-rameterization to Accelerate Training of Deep Neural Networks,” in Ad-vances in Neural Information Processing Systems, pp. 901-909, 2016.
[39] TensorFlow: an open source Python package for machine intelligence, https://www.tensorflow.org, retrieved Dec. 1, 2016.
[40] J. Dean, et al. “Large-Scale Deep Learning for Building Intelligent Com-puter Systems,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 1-1, Feb. 2016.
[41] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” IEEE 2016 24th Euro-pean Signal Processing Conference, pp. 1128-1132, Aug. 2016.
[42] DCASE2017 Challenge Baseline website, http://doi.org/10.5281/zenodo.400515, retrieved Mar. 17, 2017.
[43] DCASE2016 Challenge website, http://www.cs.tut.fi/sgn/arg/dcase2016/task-results-acoustic-scene-classification, retrieved Jun. 26, 2017.
[44] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016.