學習模態間及模態內之共用表示式｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	洪晨瑄 Chen-Hsuan Hung
論文名稱：	學習模態間及模態內之共用表示式 Learning Representations for Inter- and Intra- modality data
指導教授：	柯士文 Shih-Wen Ke
口試委員:
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理學系 Department of Information Management
論文出版年：	2022
畢業學年度：	110
語文別：	英文
論文頁數：	86
中文關鍵詞：	跨模態學習
相關次數：	點閱：10 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

目前已有許多研究專注於特定領域的表示式，包含自然語言處理領域、電腦視覺領域等，而文字也可以被用來代表特定物體，意即自然語言及圖片可能會共享相同的意義，過去已有許多文獻將文字及圖片結合並應用在圖像描述生成、圖像問答、圖像檢所等任務上。然而，卻鮮少有研究專注於多個語言及圖片之間的共用表示式，所以在我們的研究中，我們以監督式方式使用編碼器-解碼器架構學習模態間及模態內的共用表示式，並且將由編碼器產生的隱藏層作為我們所關注的共用表示式。
除了學習共用表示式外，我們進一步分析了使用我們的架構學習出來的共用表示式，我們也將此共用表示式與單一模態的專屬表示是做視覺化比較，並且證明我們的共用表示式是能夠同時有學習到文字模態資料以及圖片模態資料。除此之外我們也探討其他會影響共用表示式學習的因素，增加相似字於文字資料來做訓練可以取得較獨特且集中的共用表示式分布，同時也可保持圖片的重建能力以及文字向量的生成能力。當增加另一個語言的文字一起做訓練時，也可以在共用表示式的分布上發現如同新增相似字較獨特且集中的特性，並且仍然可以被正確的重建成原來的圖片以及相對應的文字向量。最後，我們研究了共用表示式的可擴展性，也探討了這個實驗的限制。

Many studies have investigated representation learning for domains such as Natural Language Processing or Computer Vision. Texts can be viewed as a kind of representations that stand for a certain object. In other words, natural language might share the same meaning as in an image. There were plenty of works that requires learning from both texts and images in the tasks like image captioning, visual question answering, image-to-text retrieval and so on. However, the shared representation between multiple languages and an image is seldom discussed. Hence, in this study, we propose an encoder-decoder architecture to learn the shared representations for inter- and intra-modality data. Utilizing the encoder-decoder framework, we regard the latent space vector to be the shared representation since the latent space vectors are learned from the both modalities in a supervised way to capture the shared semantics.
We also further analyze the shared representations learned via our architecture. Through visualization compared with single-modality representations, we demonstrate that our shared representations does learn from both image modality data and text modality data. We also discuss on other factors that might contribute to the shared representation learning. We find out that including synonyms for our model to learn will lead to more distinct and condensed distribution of shared representations each class while keeping the image reconstructing ability and become general on generating text vectors. When training with additional language, our shared representations are still able to be converted into images and texts correctly. In this case, we also observe the same characteristic on the distribution of shared representations as in adding synonyms. Lastly, we investigate in the scalability of our shared representation learning process and discuss on the limit to this approach.

Table of Contents
摘要 ............................................................................................................................................................. II
Abstract.......................................................................................................................................................III
List of Tables..............................................................................................................................................VI
Introduction.........................................................................................................................................1
1. Overview......................................................................................................................................1
2. Objectives.....................................................................................................................................3
3. Thesis Organization....................................................................................................................3
Related Work ......................................................................................................................................4
1. Grounded Meaning Representation Learning .........................................................................4
2. Cross-modal Representation Learning .....................................................................................7
2.1. Joint Representation...........................................................................................................7
2.2. Coordinated Representation ..............................................................................................8
2.3. Encoder-Decoder Framework .........................................................................................10
3. Single-Modal Representations.................................................................................................14
3.1. Text Representation..........................................................................................................14
3.2. Image Representation.......................................................................................................14
4. Evaluation Metrics....................................................................................................................15
4.1. P@K ...................................................................................................................................15
4.2. R@K...................................................................................................................................15
4.3. mAP....................................................................................................................................16
4.4. BLEU..................................................................................................................................16
5. Discussion...................................................................................................................................16
Methodology ......................................................................................................................................20
1. Research Questions...................................................................................................................20
2. Model Architecture...................................................................................................................21
2.1. Image Encoder and Image Decoder ................................................................................22
2.2. Text Decoder......................................................................................................................26
3. Experiments...............................................................................................................................27
3.1. Datasets..............................................................................................................................30
3.2. The effect of combining cross-modality data..................................................................30
3.3. The effect of adding synonyms.........................................................................................32
3.4. The effect of adding another language............................................................................35
V
3.5. The effect of extending numbers of classes.....................................................................36
4. Experimental Setups.................................................................................................................38
4.1. Training Phase ..................................................................................................................38
4.2. Testing Phase.....................................................................................................................39
4.3. Evaluation Phase...............................................................................................................40
5. Hyperparameters settings........................................................................................................41
Experiment Results...........................................................................................................................42
1. Experiment 1 – The effect of combining cross-modality data...............................................42
1.1. Overview of Experiment 1 Result....................................................................................42
1.2. Summary for Experiment 1 .............................................................................................52
2. Experiment 2 – The effect of adding synonyms when training.............................................52
2.1. Overview of Experiment 2 Results ..................................................................................52
2.2. Summary for Experiment 2 .............................................................................................58
3. Experiment 3 – The effect of adding another language.........................................................58
3.1. Overview of Experiment 3 Results ..................................................................................58
3.2. Summary for Experiment 3 .............................................................................................63
4. Experiment 4 – The effect of extending numbers of classes..................................................63
4.1. Overview of Experiment 4 Results ..................................................................................63
4.2. Summary for Experiment 4 .............................................................................................72
5. Summary of Experiment Results.............................................................................................72
Conclusion .........................................................................................................................................72
1. Contributions.............................................................................................................................73
2. Limitations.................................................................................................................................73
3. Future Work..............................................................................................................................73
Reference ...........................................................................................................................................75
Appendixes.........................................................................................................................................79
                                

Akaho, S., 2007. A kernel method for canonical correlation analysis. arXiv:cs/0609071.
Andrew, G., Arora, R., Bilmes, J., Livescu, K., 2013. Deep Canonical Correlation Analysis, in: International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 1247–1255.
Bach, F.R., Jordan, M.I., n.d. A Probabilistic Interpretation of Canonical Correlation Analysis 11.
Baltrušaitis, T., Ahuja, C., Morency, L., 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Barsalou, L.W., 2008. Grounded Cognition. Annu. Rev. Psychol. 59, 617–645. https://doi.org/10.1146/annurev.psych.59.103006.093639
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv14061078 Cs Stat.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North. Presented at the Proceedings of the 2019 Conference of the North, Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T., n.d. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description 10.
Dong, D., Wu, H., He, W., Yu, D., Wang, H., 2015. Multi-Task Learning for Multiple Language Translation, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Presented at the ACL-IJCNLP 2015, Association for Computational Linguistics, Beijing, China, pp. 1723–1732. https://doi.org/10.3115/v1/P15-1166
Feng, F., Wang, X., Li, R., 2014. Cross-modal Retrieval with Correspondence Autoencoder, in: Proceedings of the 22nd ACM International Conference on Multimedia. Presented at the MM ’14: 2014 ACM Multimedia Conference, ACM, Orlando Florida USA, pp. 7–16. https://doi.org/10.1145/2647868.2654902
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T., 2013. DeViSE: a deep visual-semantic embedding model, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13. Curran Associates Inc., Red Hook, NY, USA, pp. 2121–2129.
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T., n.d. DeViSE: A Deep Visual-Semantic Embedding Model 11.
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M., 2016a. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. ArXiv160601847 Cs.
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M., 2016b. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2016, Association for Computational Linguistics, Austin, Texas, pp. 457–468. https://doi.org/10.18653/v1/D16-1044
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative Adversarial Networks. ArXiv14062661 Cs Stat.
Guo, W., Wang, J., Wang, S., 2019. Deep Multimodal Representation Learning: A Survey. IEEE Access 7, 63373–63394. https://doi.org/10.1109/ACCESS.2019.2916887
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual Learning for Image Recognition. ArXiv151203385 Cs.
Huang, J., Kingsbury, B., 2013. Audio-visual deep learning for noise robust speech recognition, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Presented at the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599. https://doi.org/10.1109/ICASSP.2013.6639140
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J., 2018a. Multimodal Unsupervised Image-to-Image Translation, in: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 179–196. https://doi.org/10.1007/978-3-030-01219-9_11
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J., 2018b. Multimodal Unsupervised Image-to-Image Translation. ArXiv180404732 Cs Stat.
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., Dean, J., 2017. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 5, 339–351. https://doi.org/10.1162/tacl_a_00065
Kaiser, L., Gomez, A.N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit, J., 2017. One Model To Learn Them All. ArXiv170605137 Cs Stat.
Karpathy, A., Fei-Fei, L., 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Karpathy, A., Fei-Fei, L., n.d. Deep Visual-Semantic Alignments for Generating Image Descriptions 10.
Kiros, R., Salakhutdinov, R., Zemel, R.S., 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. ArXiv14112539 Cs.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems. pp. 1097–1105.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. https://doi.org/10.1109/5.726791
Liong, V.E., Lu, J., Tan, Y., Zhou, J., 2017. Deep Coupled Metric Learning for Cross-Modal Matching. IEEE Trans. Multimed. 19, 1234–1244. https://doi.org/10.1109/TMM.2016.2646180
Lu, J., Yang, J., Batra, D., Parikh, D., 2017. Hierarchical Question-Image Co-Attention for Visual Question Answering. ArXiv160600061 Cs.
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013b. Distributed Representations of Words and Phrases and Their Compositionality, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13. Curran Associates Inc., USA, pp. 3111–3119.
Mor, N., Wolf, L., Polyak, A., Taigman, Y., 2018. A Universal Music Translation Network. ArXiv180507848 Cs Stat.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., n.d. Multimodal Deep Learning 8.
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y., 2015. Jointly Modeling Embedding and Translation to Bridge Video and Language. ArXiv150501861 Cs.
Peng, Y., Qi, J., Yuan, Y., 2017. Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network. ArXiv170804776 Cs.
Pennington, J., Socher, R., Manning, C., 2014. Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep Contextualized Word Representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Presented at the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H., 2016. Generative Adversarial Text to Image Synthesis. ArXiv160505396 Cs.
Sachan, D., Neubig, G., 2018. Parameter Sharing Methods for Multilingual Self-Attentional Translation Models, in: Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, pp. 261–271. https://doi.org/10.18653/v1/W18-6327
Silberer, C., Ferrari, V., Lapata, M., 2017. Visually Grounded Meaning Representations. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2284–2297. https://doi.org/10.1109/TPAMI.2016.2635138
Silberer, C., Ferrari, V., Lapata, M., 2013. Models of Semantic Representation with Visual Attributes, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2013, Association for Computational Linguistics, Sofia, Bulgaria, pp. 572–582.
Silberer, C., Lapata, M., 2014. Learning Grounded Meaning Representations with Autoencoders, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2014, Association for Computational Linguistics, Baltimore, Maryland, pp. 721–732. https://doi.org/10.3115/v1/P14-1068
Silberer, C., Lapata, M., 2012. Grounded Models of Semantic Representation, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Presented at the CoNLL-EMNLP 2012, Association for Computational Linguistics, Jeju Island, Korea, pp. 1423–1433.
Silberer, C., Pinkal, M., 2018. Grounding Semantic Roles in Images, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2018, Association for Computational Linguistics, Brussels, Belgium, pp. 2616–2626. https://doi.org/10.18653/v1/D18-1282
Simonyan, K., Zisserman, A., 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv14091556 Cs.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2014. Going Deeper with Convolutions. ArXiv14094842 Cs.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K., 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Presented at the NAACL-HLT 2015, Association for Computational Linguistics, Denver, Colorado, pp. 1494–1504. https://doi.org/10.3115/v1/N15-1173
Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and Tell: A Neural Image Caption Generator. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164.
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T., 2017. Adversarial Cross-Modal Retrieval, in: Proceedings of the 25th ACM International Conference on Multimedia. Presented at the MM ’17: ACM Multimedia Conference, ACM, Mountain View California USA, pp. 154–162. https://doi.org/10.1145/3123266.3123326
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y., 2016. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv150203044 Cs.
Yan, F., Mikolajczyk, K., 2015. Deep correlation for matching images and text, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 3441–3450. https://doi.org/10.1109/CVPR.2015.7298966
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P., 2017a. Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2017, Association for Computational Linguistics, Copenhagen, Denmark, pp. 1103–1114. https://doi.org/10.18653/v1/D17-1115
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P., 2017b. Tensor Fusion Network for Multimodal Sentiment Analysis. ArXiv170707250 Cs.

簡易檢索 / 詳目顯示

相關論文