| 研究生: |
張育瑞 Yu-ruey Chang |
|---|---|
| 論文名稱: |
基於深度學習之AAC壓縮域翻唱歌快速檢索 Fast Cover Song Retrieval in AAC Domain based on Deep Learning |
| 指導教授: | 張寶基 |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 通訊工程學系 Department of Communication Engineering |
| 論文出版年: | 2015 |
| 畢業學年度: | 104 |
| 語文別: | 中文 |
| 論文頁數: | 64 |
| 中文關鍵詞: | 音樂檢索 、翻唱歌曲 、AAC 、深度學習 |
| 外文關鍵詞: | music information retrieval, cover song, AAC, deep learning |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著多媒體資料的增加,如何從龐大的資料庫中快速找到使用著有興趣的資料成為愈來愈重要的議題。傳統資料檢索的方法大多使用關鍵字來做搜尋,但需要大量人力來為資料先做標記,隨著資料量的增加,關鍵字標記的方法變得較不具彈性。內涵式檢索方法是較自然的方式,也可以避免不同人對同一首歌給定標記不一樣的問題。
本論文針對現今網路常見的音樂格式AAC,提出做在AAC壓縮域的翻唱歌快速檢索,其利用部分解碼後的MDCT係數,對應到Chroma特徵,再將多個數量的音框合成音段,作為深度學習的輸入,藉由學習自動找出更能代表音樂的關鍵特徵,並經由稀疏自編碼器把歌曲進行降維,改善傳統方法比對時間過長的問題。實驗結果顯示,所提出之方法其檢索效能MRR值為0.505,與相關文獻檢索方法相比,也節省約70%以上的比對時間。
With the increasing of multimedia data, it becomes more and more important to quickly search the interests from large databases. Keyword annotation is the traditional approach, but it needs large amount of manual effort to annotate the keyword. As the size of data increases, the keyword annotation approach becomes infeasible. Content-based retrieval is more natural, it extracts features from music content to create a representation that overcomes human labeling errors.
This thesis focuses on the AAC file which is widely used by streaming internet sources. Here, the proposed system directly maps the modified discrete cosine transform coefficients (MDCT) into a 12-dimensional chroma feature. We combine frames to a segment as the input of deep learning, deep learning can automatically find more meaningful features of music data. We also applied sparse autoencoder to reduce dimensionality of songs. With these efforts, significant matching time can be saved. The experimental results show that the proposed method can reach 0.505 of mean reciprocal rank (MRR) and save over 70% matching time compared with conventional approaches.
[1] 侯志欽, 聲學原理與多媒體音訊科技, 初版 ed. 台北市: 台灣商務, 2007.
[2] J. Serrà, E. Gómez, and P. Herrera, "Audio cover song identification and similarity: background, approaches, evaluation, and beyond," in Advances in Music Information Retrieval, ed: Springer, 2010, pp. 307-332.
[3] Music Information Retrieval Evaluation eXchange [Online]. Available: http://www.music-ir.org/mirex/wiki/MIREX_HOME
[4] ISO/IEC 13818-7 (1997) Information technology – Generic coding of moving pictures and associated audio information, Part 7: Advanced Audio Coding.
[5] E. Zwicker and H. Fastl, Psychoacoustics - Facts and Models, Springer Berlin, Heidelberg, 1990.
[6] T. M. Chang, "Chord Transformation and Performance Analysis for Compressed Audio," Ph.D. dissertation, Dept. Comm. Eng., National Central University, 2014.
[7] ISO/IEC DIS 14496-3 (1999) Information Technology - Coding of audio-visual objects, Part 3: Audio.
[8] C.T. Day, "Temporal Multi- Descriptors for Content Based Music Retrieval," M.S. thesis, Dept. Comm. Eng., National Central University, 2014.
[9] D. P. W. Ellis, and G.E. Poliner, “Identifying ‘Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, U.S.A., 2007, pp. 1429-1432.
[10] J. Serra and E. Gomez, “Audio cover song identification based on tonal sequence alignment,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Las Vegas, Nevada, U.S.A., March 30- April 4, 2008, pp.61-64.
[11] S. Ravuri and D. P. W. Ellis, "Cover song detection: From high scores to general classification," in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing (ICASSP), 2010, pp. 65-68.
[12] Z. C. Cheng, C. S. Lin, and Y. H. Chen, “Fast Music Information Retrieval Using PAT Tree Based Dynamic Time Warping,” in Proc. Int. Conf. on Communications and Signal Processing, Singapore, Dec. 2011, pp. 1 – 5.
[13] D. P. W. Ellis and B. M. Thierry, "Large-scale cover song recognition using the 2d fourier transform magnitude," in The 13th international society for music information retrieval conference, 2012, pp. 241-246.
[14] T. H. Tsai and Y. T. Wang, “Content-Based Retrieval of Audio Example on MP3 Compression Domain,” in Proc. IEEE 6th Workshop on Multimedia Signal Processing, Sep. 2004, pp.123-126.
[15] T. H. Tsai and W. C. Chang, “Two-Stage Method for Specific Audio Retrieval based on MP3 Compression Domain,” in Proc. IEEE International Symposium on Circuits and Systems, May. 2009, pp. 713-716.
[16] E. Ravelli, G. Richard, and L. Daudet, “Audio signal representations for indexing in the transform domain,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 434-446, 2010.
[17] T. M. Chang, E. T. Chen, C. B. Hsieh, and P. C. Chang, “Cover Song Identification with Direct Chroma Feature Extraction From AAC Files,” in Proc. of GCCE, Tokyo, Japan, Oct. 2013, pp. 55-56.
[18] Y. T. Chung, T. M. Chang, P. C. Chang, “Classical Music Retrieval Based on Accumulated Path Similarity in AAC Compression Domain,” in Proc. of International Conference on Internet Multimedia Computing and Service (ICIMCS), Xiamen China, July. 2014, pp. 189-192.
[19] Z. C. Cheng, C. S. Lin, and Y. H. Chen, “Fast Music Information Retrieval Using PAT Tree Based Dynamic Time Warping,” in Proc. Int. Conf. on Communications and Signal Processing, Singapore, Dec. 2011, pp. 1 – 5.
[20] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, pp. 436-444, 05/28/print 2015.
[21] The MNIST database of handwritten digits [Online].
Available: http://yann.lecun.com/exdb/mnist/
[22] B. Kwolek, "Face detection using convolutional neural networks and Gabor filters," in Artificial Neural Networks: Biological Inspirations–ICANN 2005, ed: Springer, 2005, pp. 551-556.
[23] T. N. Sainath, A. R. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8614-8618.
[24] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch," The Journal of Machine Learning Research, vol. 12, pp. 2493-2537, 2011.
[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, pp. 533-536, 10/09/print 1986.
[26] (2015, August 17). Deep Learning Tutorial (Release 0.1 ed.) [Online]. Available: http://deeplearning.net/tutorial/deeplearning.pdf
[27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layer-wise training of deep networks," Advances in neural information processing systems, vol. 19, p. 153, 2007.
[28] G. Casella and E. I. George, "Explaining the Gibbs sampler," The American Statistician, vol. 46, pp. 167-174, 1992.
[29] A. Mnih and G. Hinton, "Learning nonlinear constraints with contrastive backpropagation," in Proc. IEEE International Joint Conference on Neural Networks (IJCNN), 2005, pp. 1302-1307.
[30] V. Nair and G. E. Hinton, "3D object recognition with deep belief nets," in Advances in Neural Information Processing Systems, 2009, pp. 1339-1347.
[31] A. Mohamed, G. Dahl, and G. Hinton, "Deep Belief Networks for phone recognition," NIPS 22 workshop on deep learning for speech recognition, 2009.
[32] G. Hinton, D. Li, Y. Dong, G. E. Dahl, A. Mohamed, N. Jaitly, et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," Signal Processing Magazine, IEEE, vol. 29, pp. 82-97, 2012.
[33] G. Hinton, “A practical guide to training restricted Boltzmann machines,” Machine Learning Group, University of Toronto, Technical report, 2010.
[34] M. A. Keyvanrad and M. M. Homayounpour. (2014, August 1, 2014). A brief survey on deep belief networks and introducing a new object oriented MATLAB toolbox (DeeBNet V2.2). ArXiv e-prints 1408, 3264. Available: http://adsabs.harvard.edu/abs/2014arXiv1408.3264K
[35] G. Hinton, S. Osindero, and Y. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, pp. 1527-1554, 2006.
[36] A. Ng, "Sparse autoencoder," CS294A Lecture notes, vol. 72, 2011.
[37] Y. Bengio, A. Courville, and P. Vincent, "Representation Learning: A Review and New Perspectives," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798-1828, 2013.
[38] The Covers 80 cover song data set, [Online].
Available: http://labrosa.ee.columbia.edu/projects/coversongs/covers80/
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
[40] R. B. Palm, "Prediction as a candidate for learning deep hierarchical models of data," Technical University of Denmark, 2012.