基於遮罩注意力之像素級對比式自監督式學習｜國立中央大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	劉慎軒 SHEN-HSUAN LIU
論文名稱：	基於遮罩注意力之像素級對比式自監督式學習 Heuristic Attention Pixel-Level Contrastive Loss for Self-supervised Visual Representation Learning
指導教授：	王家慶 Jia-Ching Wang
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	44
中文關鍵詞：	深度學習、自監督式學習、表徵學習
外文關鍵詞：	Deep learning, Self-supervised learning, Representation learning
相關次數：	點閱：30 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在深度學習中，高準確率除了模型架構或訓練方法的設計之外，大量的訓練資料也是不可或缺的一部分，然而在傳統的監督式學習中，大量訓練資料就意味著需要大量高質量標籤，這使得訓練模型的成本非常的高，故近年來有學者提出自監督式學習這個觀念，運用容易取得的大量無標籤訓練資料來預訓練模型，之後再用極少量帶標籤的資料做二次訓練即可得到高準確率，並減少人工標記的成本。
近期在電腦視覺領域中的自監督式學習大多是基於整張影像的特徵計算對比式損失函數 (contrastive loss)，通過在向量空間最小化相同影像特徵間的相似度，這種實例級(instance-level)的訓練方式對於運用整張影像特徵的任務(如分類任務)有很好的效果，但對於需要用到像素間差異的任務(物件偵測或實例分割)就不是那麼理想，故本文提出了一種基於遮罩注意力之像素級對比式學習方法Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL), 透過非監督式學習的方法生成影像的前景遮罩，依生成的遮罩將編碼器 (Encoder)所取得整張影像的特徵圖區分成前景與背景特徵，再依照前景特徵及背景特徵向量計算像素級的對比式損失函數，以提高模型在物件偵測及分割任務上的準確率。

Training a high-accuracy deep-learning model depends on various factors, such as the model architecture and training method. In addition, a large number of high-quality labeled datasets is necessary. However, it must be an unaffordable cost to collect such large-scale and high-quality datasets, which also becomes the barrier to train a high-accuracy model in the framework of supervised learning. Recently, the concept of self-supervised learning has been proposed. We can pre-train a deep learning model with the unlabeled dataset, and achieve a higher accuracy deep learning model by finetuning on the few labeled datasets. Therefore, the aforementioned issue is alleviated by applying the framework of self-supervised learning.
In self-supervised learning, most of the previous works measure the contrastive loss based on the feature extracted from the entire image. These kinds of measurements based on the instance level are suitable for the classification task. However, it is not ideal for tasks that require pixel-level information, such as object detection and instance segmentation. Therefore, we have proposed a pixel-level contrastive learning method based on mask attention, which is called Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL). In HAPiCL, we generate the binary mask to split the input image into the foreground and background features through an unsupervised learning method. During the training stage, the model will measure the pixel-level contrastive loss with the foreground and background features. Such a method results in better performance in object detection as well as instance segmentation.

一.    緒論    1
1-1、    研究背景及動機    1
1-2、    研究目的    2
1-3、    論文架構    3
二.    文獻探討    4
2-1、    SimCLR    5
2-2、    MoCo和MoCo V2    7
2-3、    BYOL    8
2-4、    Pixel-Level Consistency    9
三.    方法介紹    12
3-1、    非監督式學習遮罩生成    13
3-2、    遮罩剪裁(Mask Cropping)    14
3-3、    Mask Pixel level contrastive loss    15
四.    實驗結果與討論    19
4-1、    資料集    19
4-2、    評估方式    19
4-2-1.    線性評估(linear evaluation)    19
4-2-2.    模型微調(Fine-tuning Procedure)    20
4-2-3.    遷移式學習(Transfer Learning)    20
4-3、    實驗結果    21
4-3-1.    線性評估(linear evaluation)    21
4-3-2.    模型微調(Fine-tuning Procedure)    22
4-3-3.    遷移式學習(Transfer Learning)    23
4-4、    其他實驗結果    26
4-4-1.    比較加上實例級的效果    26
4-4-2.    比較使用不同剪裁的效果    26
4-4-3.    遮罩剪裁foreground rate 比較    27
4-4-4.    用ImageNet100預訓練結果    28
4-4-5.    ConvMLP輸出大小比較    29
4-4-6.    Batch size 比較    29
五.    結論與未來的展望    30
5-1、    結論    30
5-2、    未來展望    30
六.    參考資料    32

                                

[1]. DEVLIN, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[2]. KOMODAKIS, Nikos; GIDARIS, Spyros. “Unsupervised representation learning by predicting image rotations,” In International Conference on Learning Representations (ICLR). 2018.
[3]. HE, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 9729-9738.
[4]. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.” A simple framework for contrastive learning of visual representations,” arXiv:2002.05709, 2020.
[5]. GRILL, Jean-Bastien, et al. “Bootstrap your own latent-a new approach to self-supervised learning,” In Advances in Neural Information Processing Systems, 2020, 33: 21271-21284.
[6]. CHEN, Xinlei, et al. “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
[7]. Russakovsky, O., et al. “ImageNet Large Scale Visual Recognition Challenge,” In International Journal of Computer Vision, 2015. 115: p. 211-252.
[8]. PATHAK, Deepak, et al. “Context encoders: Feature learning by inpainting,” In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 2536-2544.
[9]. SOHN, Kihyuk. “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, 2016, 29.
[10]. Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. “Unsupervised feature learning via non-parametric instance discrimination,” In CVPR, 2018.
[11]. XIE, Zhenda, et al. “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 16684-16693.
[12]. RONNEBERGER, Olaf; FISCHER, Philipp; BROX, Thomas. “U-net: Convolutional networks for biomedical image segmentation,” In International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. p. 234-241.
[13]. CHEN, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” In IEEE transactions on pattern analysis and machine intelligence, 2017, 40.4: 834-848.
[14]. JIANG, Huaizu, et al. “Salient object detection: A discriminative regional feature integration approach,” In Proceedings of the IEEE conference on computer vision and pattern recognition. 2013. p. 2083-2090.
[15]. FELZENSZWALB, Pedro F.; HUTTENLOCHER, Daniel P. “Efficient graph-based image segmentation,” In International journal of computer vision, 2004, 59.2: 167-181.
[16]. VAN GANSBEKE, Wouter, et al. “Unsupervised semantic segmentation by contrasting object mask proposals,” In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 10052-10062.
[17]. Zhang, S., et al. “Interactive Object Segmentation With Inside-Outside Guidance,” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: p. 12231-12241.
[18]. YOU, Yang; GITMAN, Igor; GINSBURG, Boris. “Large batch training of convolutional networks,” arXiv preprint arXiv:1708.03888, 2017.
[19]. GOYAL, Priya, et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[20]. Lin, T.-Y., et al. “Microsoft COCO: Common Objects in Context,” In ECCV. 2014.
[21]. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards real-time object detection with region proposal networks,” In NeurIPS, 2015.
[22]. Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. “Mask R-CNN,”. In ICCV, 2017.
[23]. CHEN, Xinlei; HE, Kaiming. “Exploring simple siamese representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 15750-15758.

簡易檢索 / 詳目顯示

相關論文