跳到主要內容

簡易檢索 / 詳目顯示

研究生: 楊歷恆
Alex Li-Heng Yang
論文名稱: Multimodal Composed Image Retrieval Using Querying-Transformer
指導教授: 孫敏德
Min-Te Sun
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 42
中文關鍵詞: 圖片搜索
外文關鍵詞: Composed Image Retrieval
相關次數: 點閱:12下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基於組合影像檢索系統的重要性在於它能夠讓用戶使用視覺參
    考和描述文字來找到特定影像,解決了傳統僅靠文字檢索方法的局
    限性。在本論文中,我們提出了一種利用 Querying-Transformer 來
    解決傳統影像檢索方法局限性的系統。Qformer 通過基於
    Transformer 的架構,將影像和文字數據整合在一起,能夠熟練地捕
    捉這兩種模式之間的複雜關係。通過引入影像-文字匹配損失函數,
    我們的系統顯著提高了影像與文字匹配的準確性,確保了視覺和文
    字表現之間的高度一致性。我們還在 Qformer 模型中使用了殘差學
    習技術,以保留重要的視覺信息,從而在學習過程中保持原始影像
    的質量和特徵。
    為了驗證我們方法的效果,我們在 FashionIQ 和 CIRR 數據集上
    進行了實驗。結果顯示,我們提出的系統在各種類別中顯著優於現
    有模型,實現了更高的召回率指標。實驗結果展示了我們系統在實
    際應用中的潛力,提供了在影像檢索任務中精確性和相關性方面的
    顯著改進。


    Composed Image Retrieval (CIR) systems are crucial because they enable users to find specific images using both visual references and descriptive text, addressing the limitations of traditional text-only search methods. In this thesis, we propose a system that utilizes the Querying-Transformer (Qformer) to address the limitations of traditional image retrieval methods. The Qformer integrates image and text data through a transformer-based architecture, adeptly capturing complex relationships between the two modalities. By incorporating the Image-Text Matching (ITM) loss function, our system significantly enhances the accuracy of image-text matching, ensuring superior alignment between visual and textual representations. We also employ residual learning techniques within the Qformer model to preserve essential visual information, thereby maintaining the quality and features of the original images throughout the learning process. To confirm the efficacy of our approach, we performed experiments on the FashionIQ and CIRR datasets. The results show that our proposed system significantly outperforms existing models, achieving superior recall metrics across various categories. The experimental results demonstrate the potential of our system in practical applications, offering robust improvements in the precision and relevance of image retrieval tasks.

    1 Introduction 1 2 Related Work 4 2.1 Visual and Language Pre-training . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Non-Contrast Learning-based Models . . . . . . . . . . . . . . . . . 4 2.1.2 Contrast Learning-based Models . . . . . . . . . . . . . . . . . . . . 4 2.2 Composed Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 LSTM-based Composed Image Retrieval . . . . . . . . . . . . . . . 5 2.2.2 Attention Mechanism-based Composed Image Retrieval . . . . . . . 5 2.2.3 BERT-based Composed Image Retrieval . . . . . . . . . . . . . . . 6 2.2.4 vision-Language Foundation Composed Image Retrieval . . . . . . . 6 3 Preliminary 7 3.1 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 CLIP4Cir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Combiner network . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 BLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Qformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Position-guided Text Prompt . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.6.1 Block Tag Generation . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Design 15 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.5 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.5.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.5.2 Qformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Performance 22 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Environmental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 24 5.4.1 Experiment Results of CIRR Dataset . . . . . . . . . . . . . . . . . 25 5.4.2 Experiment Results of FashionIQ Dataset . . . . . . . . . . . . . . 25 5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6 Conclusion 28

    [1] Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional
    learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF
    Winter Conference on Applications of Computer Vision (WACV), pages 1140–1149,
    January 2021.
    [2] Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick
    Siow Mong Goh, and Chun-Mei Feng.
    Sentence-level prompts benefit composed
    image retrieval. arXiv preprint arXiv:2310.05473, 2023.
    [3] Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Condi-
    tioned and composed image retrieval combining and partially fine-tuning clip-based
    features.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pages 4959–4968, 2022.
    [4] Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback
    by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition (CVPR), June 2020.
    [5] Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus.
    Artemis: Attention-based retrieval with text-explicit matching and implicit similar-
    ity. arXiv preprint arXiv:2203.08101, 2022.
    [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
    training of deep bidirectional transformers for language understanding.
    arXiv
    preprint arXiv:1810.04805, 2018.
    [7] Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye.
    Modality-agnostic attention fusion for visual search with text feedback.
    arXiv
    preprint arXiv:2007.00145, 2020.
    [8] Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha
    Hedau, and Pradeep Natarajan. Fashionvlp: Vision language transformer for fashion
    retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), pages 14105–14115, June 2022.
    [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
    for image recognition. In Proceedings of the IEEE conference on computer vision and
    pattern recognition, pages 770–778, 2016.
    [10] Mehrdad Hosseinzadeh and Yang Wang.
    Composed query image retrieval using
    locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, pages 3596–3605, 2020.
    [11] Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar,
    and Balaji Krishnamurthy. Sac: Semantic attention composition for text-conditioned
    image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications
    of Computer Vision, pages 4021–4030, 2022.
    [12] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V.
    Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language
    representation learning with noisy text supervision, 2021.
    [13] Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. Dual compositional
    learning in interactive image retrieval. In Proceedings of the AAAI Conference on
    Artificial Intelligence, volume 35, pages 1771–1779, 2021.
    [14] Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation
    for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference
    on Computer Vision and Pattern Recognition, pages 802–812, 2021.
    [15] Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and
    quality assessment for composed image retrieval, 2023.
    [16] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
    Blip-2: Bootstrapping
    language-image pre-training with frozen image encoders and large language models.
    In International conference on machine learning, pages 19730–19742. PMLR, 2023.
    [17] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
    Blip:
    Bootstrapping
    language-image pre-training for unified vision-language understanding and genera-
    tion. In International conference on machine learning, pages 12888–12900. PMLR,
    2022.
    [18] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image
    retrieval on real-life images with pre-trained vision-and-language models. In 2021
    IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October
    2021.
    [19] Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould. Bi-
    directional training for composed image retrieval via text prompt learning. In Pro-
    ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
    pages 5753–5762, 2024.
    [20] Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. Candidate set re-
    ranking for composed image retrieval with dual multi-modal encoder. arXiv preprint
    arXiv:2305.16304, 2023.
    [21] Ilya Loshchilov and Frank Hutter.
    Sgdr: Stochastic gradient descent with warm
    restarts, 2016.
    [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
    [23] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-
    agnostic visiolinguistic representations for vision-and-language tasks, 2019.
    [24] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
    recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
    [25] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
    and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
    structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
    [26] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.
    Film: Visual reasoning with a general conditioning layer. Proceedings of the AAAI
    Conference on Artificial Intelligence, 32(1), April 2018.
    [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
    hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
    Learning transferable visual models from natural language supervision. In Interna-
    tional conference on machine learning, pages 8748–8763. PMLR, 2021.
    [28] Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. Rtic: Residual learning
    for text and image composition using graph convolutional network. arXiv preprint
    arXiv:2104.03015, 2021.
    [29] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech
    Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and
    vision alignment model. In 2022 IEEE/CVF Conference on Computer Vision and
    Pattern Recognition (CVPR). IEEE, June 2022.
    [30] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representa-
    tions from transformers. In Proceedings of the 2019 Conference on Empirical Methods
    in Natural Language Processing and the 9th International Joint Conference on Nat-
    ural Language Processing (EMNLP-IJCNLP). Association for Computational Lin-
    guistics, 2019.
    [31] Lucas Ventura, Antoine Yang, Cordelia Schmid, and G¨ul Varol. CoVR: Learning
    composed video retrieval from web video captions. AAAI, 2024.
    [32] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays.
    Composing text and image for image retrieval-an empirical odyssey. In Proceedings of
    the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–
    6448, 2019.
    [33] Jinpeng Wang, Pan Zhou, Mike Zheng Shou, and Shuicheng Yan. Position-guided
    text prompt for vision-language pre-training. In 2023 IEEE/CVF Conference on
    Computer Vision and Pattern Recognition (CVPR). IEEE, June 2023.
    [34] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau-
    man, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by
    natural language feedback, 2019.
    [35] Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. Curlingnet: Com-
    positional learning between images and text for fashion iq data.
    arXiv preprint
    arXiv:2003.12299, 2020.

    QR CODE
    :::