| 研究生: |
沈正勝 Seksan Mathulaprangsan |
|---|---|
| 論文名稱: |
聯合局部保留字典學習法研究 A Study of Locality Preserved Joint Dictionary Learning |
| 指導教授: |
王家慶
Jia-Ching Wang |
| 口試委員: | |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 字典學習 、聯合詞典學習 、局部特徵保留 、非負矩陣分解 |
| 外文關鍵詞: | dictionary learning, joint dictionary learning, locality preserving, nonnegative matrix factorization |
| 相關次數: | 點閱:11 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文結合局部特徵保留(Locality preserving)技術與字典學習方法,並且藉此提升其應用在語音情緒辨識以及物件辨識上之效果。
首先,針對影像物件辨識應用,我們提出二個新穎之局部保留字典學習方法,其一為具鑑別性(Discriminative)之局部保留KSVD(LP-KSVD),將標籤資訊引入局部保留項。其二為標籤一致性(Label-consistent)之LP-KSVD(LCLP-KSVD),利用標籤一致做為限制項來進一步強化不同類別間之鑑別性。
接著,本論文針對語音情緒辨識應用,提出具局部保留之聯合非負矩陣分解(Joint nonnegative matrix factorization)方法(LP-JNMF),透過同時重建語音特徵與訓練一簡單線性分類器,來學習具備高度鑑別力之共通特徵。此外,我們也引入局部保留限制項來使得學習出的特徵保留高維度特徵之流型(Manifold)。
實驗結果顯示,所提出的方法在物件辨識與語音情緒辨識的應用上,優於多項先進字典學習方法。
This study focuses on using the locality preserving technique, which uses geometric information of data, to boost up the performance of dictionary learning approaches to a number of pattern recognition tasks including speech emotion recognition and object recognition.
Firstly, to exploit fully the potential of the locality-preserving technique for the object recognition task, two novel dual-layer locality-preserving methods were developed. The former is the discriminative LP-KSVD (DLP-KSVD), which incorporates the label information into locality-preserving term. The latter is the label-consistent LP-KSVD (LCLP-KSVD), which applied the label-consistent constraint to the original LP-KSVD model to penalize the sparse codes from different classes to improve the discriminative power.
Secondly, a novel approach for speech emotion recognition, named locality preserved joint NMF (LP-JNMF), is introduced. This study achieves two goals jointly; the first is to learn a dictionary for the reconstruction of input acoustic features and the second is to learn a simple linear classifier for annotation. Since the learned representations are shared between the learned dictionaries and annotation matrix, the discriminative power is promoted. Moreover, to preserve the manifold of input acoustic features, a locality penalty term is incorporated into the objective function of joint dictionary learning. Thus, the discriminability of the learned dictionary is further improved.
Experimental results prove that the proposed methods outperform the baseline algorithms, which are state-of-the-art dictionary learning algorithms for object recognition and speech emotion recognition problems.
[1] J. C. Wang, Y. H. Chin, B. W. Chen, C. H. Lin, and C. H. Wu, “Speech Emotion Verification Using Emotion Variance Modeling and Discriminant Scale-Frequency Maps,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1552–1562, 2015.
[2] M. J. Gangeh, P. Fewzee, A. Ghodsi, M. S. Kamel, and F. Karray, “Multiview Supervised Dictionary Learning in Speech Emotion Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1056–1068, Jun. 2014.
[3] S. Lazebnik and M. Raginsky, “Supervised Learning of Quantizer Codebooks by Information Loss Minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 7, pp. 1294–1309, Jul. 2009.
[4] X. C. Lian, Z. Li, C. Wang, B. L. Lu, and L. Zhang, “Probabilistic Models for Supervised Dictionary Learning,” in IEEE Conf. Comput. Vision Pattern Recognition (CPVR), 2010, pp. 2305–2312.
[5] J. Yang, K. Yu, and T. Huang, “Supervised translation-invariant sparse coding,” IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3517–3524, 2010.
[6] H. Zhang, Y. Zhang, and T. S. Huang, “Simultaneous Discriminative Projection and Dictionary Learning for Sparse Representation Based Classification,” Pattern Recognit., vol. 46, no. 1, pp. 346–354, 2013.
[7] Q. Zhang and B. Li, “Discriminative K-SVD for Dictionary Learning in Face Recognition,” in IEEE Conf. Comput. Vision Pattern Recognition (CPVR), 2010, pp. 2691–2698.
[8] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a Discriminative Dictionary for Sparse Coding via Label Consistent K-SVD,” in IEEE Conf. Comput. Vision Pattern Recognition (CPVR), 2011, pp. 1697–1704.
[9] W. Liu, Z. Yu, M. Yang, L. Lu, and Y. Zou, “Joint kernel dictionary and classifier learning for sparse coding via locality preserving K-SVD,” Proc. - IEEE Int. Conf. Multimed. Expo, vol. 2015–Augus, 2015.
[10] X. He and P. Niyogi, “Locality Preserving Projections,” in Proc. Conf. Advances Neural Inform. Process. Syst., 2003, pp. 153–160.
[11] Y. Zhou, J. Gao, and K. E. Barner, “Locality Preserving KSVD for Nonlinear Manifold Learning,” in Acoust., Speech, and Signal Process. (ICASSP), 2013, pp. 3372–3376.
[12] T. Komatsu, Y. Senda, and R. Kondo, “Acoustic Event Detection Based on Non-negative Matrix Factorization with Mixtures of Local Dictionaries and Activation Aggregation,” in Acoust., Speech, and Signal Process. (ICASSP), 2016, pp. 2259–2263.
[13] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound Event Detection in Real Life Recordings Using Coupled Matrix Factorization of Spectral Representations and Class Activity Annotations,” in Acoust., Speech, and Signal Process. (ICASSP), 2015, pp. 151–155.
[14] Z. Wu, E. S. Chng, and H. Li, “Joint nonnegative matrix factorization for exemplar-based voice conversion.”
[15] S. Fu, P. Li, Y. Lai, C. Yang, L. Hsieh, and Y. Tsao, “Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to,” vol. 64, no. 11, pp. 2584–2594, 2017.
[16] A. Y. N. Honglak Lee, Alexis Battle, Rajat Raina, “Efficient Sparse coding algorithms,” Adv. nerual infromation Process. Syst., pp. 801–808, 2006.
[17] K. Gregor and Y. Lecun, “Learning Fast Approximations of Sparse Coding,” Vision, Image Signal Process. IEE Proc. -, vol. 152, no. 3, pp. 318–326, 2005.
[18] L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?,” Proc. IEEE Int. Conf. Comput. Vis., pp. 471–478, 2011.
[19] Z. Zhang, S. Member, Y. Xu, and S. Member, “A Survey of Sparse Representation : Algorithms and Applications,” IEEE Access, vol. 3, pp. 490–530, 2015.
[20] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approximations with Bregman divergences,” in Advances in neural information processing systems 18, 2005.
[21] R. Tandon and S. Sra, “Sparse nonnegative matrix approximation: new formulations and algorithms,” Tech Report No. 193, Max-Planck, 2010.
[22] K. Jeong, J. Song, and H. Jeong, “NMF Features for Speech Emotion Recognition,” in Proceedings of the 2009 International Conference on Hybrid Information Technology, 2009, pp. 368–374.
[23] K. Jeong, J. Song, and H. Jeong, “Spectral Analysis for Emotion Recognition by NMF Features,” in 2009 Fifth International Conference on Natural Computation, 2009, vol. 5, pp. 121–125.
[24] S.-Y. Lee, H.-A. Song, and S. Amari, “A new discriminant NMF algorithm and its application to the extraction of subtle emotional differences in speech,” Cognitive Neurodynamics, vol. 6, no. 6. Dordrecht, pp. 525–535, Dec-2012.
[25] D. Kim, S. Y. Lee, and S. I. Amari, “Representative and discriminant feature extraction based on NMF for emotion recognition in speech,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5863 LNCS, no. PART 1, C. S. Leung, M. Lee, and J. H. Chan, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 649–656.
[26] P. Song, S. Ou, W. Zheng, Y. Jin, and L. Zhao, “Speech emotion recognition using transfer non-negative matrix factorization,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5180–5184.
[27] Z. Wu, E. Chng, and H. Li, “Joint nonnegative matrix factorization for exemplar-based voice conversion,” Multimed. Tools Appl., vol. 74, 2014.
[28] L. Zhang, G. Bao, Y. Luo, and Z. Ye, “Monaural Speech Enhancement Using Joint Dictionary Learning with Cross-Coherence Penalties,” Proc. - 2015 8th Int. Symp. Comput. Intell. Des. Isc. 2015, vol. 2, pp. 518–522, 2016.
[29] J. Sadasivan, S. Mukherjee, and C. S. Seelamantula, “Joint dictionary training for bandwidth extension of speech signals,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2016–May, pp. 5925–5929, 2016.
[30] Y. K. Yılmaz and a T. Cemgil, “Generalised Coupled Tensor Factorisation,” Adv. Neural Inf. Process. Syst., pp. 2151--2159, 2011.
[31] D. Cai, X. He, J. Han, and T. S. Huang, “Graph Regularized Nonnegative Matrix Factorization for Data Representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, 2011.
[32] J. Wang, J. Yang, K. Yu, F. Lv, and T. Huang, “Locality-constrained Linear Coding for Image Classification,” in IEEE Conf. Comput. Vision Pattern Recognition (CPVR), 2010, pp. 3360–3367.
[33] L. Fei-fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples : An incremental Bayesian approach tested on 101 object categories,” vol. 106, pp. 59–70, 2007.
[34] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust Face Recognition via Sparse Representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.
[35] S. Lazebnik and C. Schmid, “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognition (CVPR), 2006, pp. 2169–2178.
[36] J. C. Lin, C. H. Wu, and W. L. Wei, “Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition,” IEEE Trans. Multimed., vol. 14, no. 1, pp. 142–156, Feb. 2012.
[37] M. R. Schädler and B. Kollmeier, “Separable Spectro-temporal Gabor Filter Bank Features: Reducing the Complexity of Robust Features for Automatic Speech Recognition,” J. Acoust. Soc. Am., vol. 137, no. 4, pp. 2047–2059, 2015.
[38] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: Learning a discriminative dictionary for recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2651–2664, 2013.
[39] Y. S. Lee, C. Y. Wang, S. Mathulaprangsan, J. H. Zhao, and J. C. Wang, “Locality-preserving K-SVD Based Joint Dictionary and Classifier Learning for Object Recognition,” in Proc. ACM Multimedia Conf., 2016, pp. 481–485.
[40] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit,” CS Tech., pp. 1–15, 2008.
[41] R. Hennequin, “NMF-matlab, https://github.com/romi1502/NMF-matlab.” 2015.
[42] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic Emotion Recognition using Prosodic Parameters,” in in Proc. of INTERSPEECH, 2005, pp. 493–496.