| 研究生: |
林郁凱 Yu-Kai Lin |
|---|---|
| 論文名稱: |
深度類神經網路於環境音偵測之應用與改良 The Applications and Improvements of Deep Neural Networks in Environmental Sound Recognition |
| 指導教授: |
蘇木春
Mu-Chun Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 軟體工程研究所 Graduate Institute of Software Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 深度神經網路 、卷積神經網路 、環境音偵測 、特徵融合 |
| 外文關鍵詞: | Deep Neuron network, Convolutional Neuron Network, Environmental Sound Recognition, Feature Combination |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
類神經網路已能在聲音辨識上取得極好的成績,多種不同的聲音特徵都被嘗試作為網路的輸入進行訓練辨識,然而以原始聲音訊號作為網路輸入,測試神經網路是否能夠自行擷取出聲音特徵依舊是一門挑戰。本文改良了現有原始訊號網路的架構,利用高層數的深度神經網路成功提升了訊號輸入分析的效果,以擬似頻譜轉換的方式,探討正確的參數設定,最終提出的1d-2d network 於ESC50中可成功達到73.55%的正確率。
除此之外,本文亦提出一種特徵融合的網路架構,利用全域池化層的特性,整合出一種較具彈性的結合方式。利用此類網路,本文成功結合了利用原始訊號輸入及利用對數梅爾頻譜係數輸入的兩種網路,我們提出的ParallelNet在ESC50中以上述輸入得到了81.55%的辨識效果,達到了人類辨識水平。
Neural network has achieved a great result in the sound recognition, many different kinds of acoustic features have been tried as the training input with the network. However, there is still under doubt that the whether the neural network could efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches, with the deeper network architectures, the raw signals get the well analysis with our network, we also make the discussion in several kinds of network settings, with the spectrogram-like conversion, our network could reach the accuracy of 73.55% in the open-audio-dataset ESC50.
Besides, in this study, we proposed a network architectures that could combine different kinds of networks feed with different features. With the help of global pooling, a flexible fusion way is well integrated into the network. Our experiment successfully combined two different networks which use different kinds of audio feature inputs—raw audio signal and log-mel spectrum. By the above settings, the ParallelNet we proposed finally reaches the accuracy of 81.55% in ESC50, which also reaches the recognition level of human being.
[1] J. Chen, A. H. Kam, J. Zhang, N. Liu and L. Shue, "Bathroom activity monitoring based on sound," in International Conference on Pervasive Computing, 2005.
[2] F. Weninger and B. Schuller, "Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations," in acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference, 2011.
[3] C. Clavel, T. Ehrette and G. Richard, "Events detection for an audio-based surveillance system," in Multimedia and Expo, 2005. ICME 2005. IEEE International conference, 2005.
[4] M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini and A. Abad, "Detecting audio events for semantic video search," in Tenth Annual Conference of the International Speech Communication Association, 2009.
[5] A.-r. Mohamed, G. Hinton and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference, 2012.
[6] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson and O. Vinyals, "Learning the speech front-end with raw waveform CLDNNs," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[7] H. Lee, P. Pham, Y. Largman and A. Y. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Advances in neural information processing systems, 2009.
[8] A. Van den Oord, S. Dieleman and B. Schrauwen, "Deep content-based music recommendation," in Advances in neural information processing systems, 2013.
[9] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi and T. Sorsa, "Computational auditory scene recognition," in Acoustics, speech, and signal processing (icassp), 2002 IEEE international conference, 2002.
[10] L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[11] Y.-T. Peng, C.-Y. Lin, M.-T. Sun and K.-C. Tsai, "Healthcare audio event classification using hidden markov models and hierarchical hidden markov models," in Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, 2009.
[12] B. Elizalde, A. Kumar, A. Shah, R. Badlani, E. Vincent, B. Raj and I. Lane, "Experiments on the DCASE challenge 2016: Acoustic scene classification and sound event detection in real life recording," arXiv preprint arXiv:1607.06706, 2016.
[13] J.-C. Wang, J.-F. Wang, K. W. He and C.-S. Hsu, "Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor," in Neural Networks, 2006. IJCNN'06. International Joint Conference, 2006.
[14] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.
[15] K. J. Piczak, "Environmental sound classification with convolutional neural networks," in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop, 2015.
[16] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange and M. D. Plumbley, "Detection and classification of acoustic scenes and events," IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733-1746, 2015.
[17] "DCASE 2017 Workshop," [Online]. Available: http://www.cs.tut.fi/sgn/arg/dcase2017/. [Accessed 30 - June - 2017].
[18] Y. Aytar, C. Vondrick and A. Torralba, "Soundnet: Learning sound representations from unlabeled video," Advances in Neural Information Processing Systems, pp. 892-900, 2016.
[19] W. Dai, C. Dai, S. Qu, J. Li and S. Das, "Very deep convolutional neural networks for raw waveforms," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, 2017.
[20] M. Lin, Q. Chen and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.
[21] Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural network," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, 2017.
[22] Y. Tokozume, Y. Ushiku and T. Harada, "Learning from Between-class Examples for Deep Sound Recognition," in ICLR 2018 Conference, 2018.
[23] F. Rosenblatt, "The perceptron: a probabilistic model for information storage and organization in the brain.," Psychological review, vol. 65, pp. 386-408, 1958.
[24] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning representations by back-propagating errors," nature, vol. 323, no. 6088, p. 533, 1986.
[25] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[26] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot and others, "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529, no. 7587, p. 484, 2016.
[27] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010.
[28] K. He, X. Zhang, S. Ren and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proceedings of the IEEE international conference on computer vision, 2015.
[29] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
[30] K. J. Piczak, "ESC: Dataset for environmental sound classification," in Proceedings of the 23rd ACM international conference on Multimedia, 2015.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference, 2009.
[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll'ar and C. L. Zitnick, "Microsoft coco: Common objects in context," in European conference on computer vision, 2014.
[33] J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279-283, 2017.
[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.
[35] Y. Nesterov, "Gradient Methods for Minimizing Composite," 2007.
[36] V. Boddapati, A. Petef, J. Rasmusson and L. Lundberg, "Classifying environmental sounds using image recognition networks," Procedia Computer Science, vol. 112, pp. 2048-2056, 2017.
[37] K. Simonyan, A. Vedaldi and A. Zisserman, "Deep inside convolutional networks: Visualising image classification models and saliency maps," arXiv preprint arXiv:1312.6034, 2013.
[38] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European conference on computer vision, 2014.
[39] J. Salamon, C. Jacoby and J. P. Bello, "A dataset and taxonomy for urban sound research," in Proceedings of the 22nd ACM international conference on Multimedia, 2014.