跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳文研
Tran Van Nhiem
論文名稱: 深度學習基礎模型與自監督學習
Deep Learning Foundation Model with Self-Supervised Learning
指導教授: 王家慶
Jia-Ching Wang
栗永徽
Yung-Hui Li
口試委員:
學位類別: 博士
Doctor
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 131
中文關鍵詞: 自監督學習計算機視覺視覺表徵學習深度神經網絡圖像分析特徵學習
外文關鍵詞: Self-Supervised Learning, Deep Learning Foundation Model, Computer Vision Foundation Model, Visual Representation learning, Deep Neural Network, Image Processing
相關次數: 點閱:22下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 最近在自監督式學習的發展讓我發現其取代傳統監督式學習的可能性,尤其是自監督式學習解決了傳統監督式學習的需要大量標記資料及對不同任務泛化性不高的問題。自監督式學習使用容易獲得的未標記數據對深度神經網絡進行預訓練,然後在下游任務上進行微調,相比於監督式學習需要更少的標記資料。值得注意的是,自監督學習在包括文本、視覺、 語音等多個領域均展現出成功。
    在本簡報中,我們提出了數種新穎的自監督式學習方法,用於視覺表徵學習,可以提高多個計算機視覺下游任務的效果。這些方法目標是利用輸入數據本身來生成學習目標。我們的第一種方法HAPiCLR利用影像的上下文表徵中的像素級信息,並結合對比式學習目標, 使其能夠為下游任務學習更有效的圖像表徵。第二種方法HARL引入了一種基於啟發式注意力的方法,最大化向量空間中抽象對象級嵌入,從而產生更高質量的語義表徵。最後,MVMA框架結合了多個資料擴增的輸入,利用每個訓練樣本的全局和局部信息, MVMA框架可以探索廣泛的圖像外觀,這種方法產生的表徵具有對於不同尺度的影像有很高的魯棒性,使其對下游任務有更高的泛化性及提高訓練的效率。
    這些方法顯著改善了圖像分類、物件偵測和語義分割等任務的性能。它們展示了自監督式學習提取圖像特徵的能力,從而提高了在各種計算機視覺任務中的深度神經網絡效果及效率。本論文不僅介紹了新的學習算法,還提供了對自監督表徵的全面分析,揭示了不同模型之間的區別因素。總的來說,它展示了一套創新、高效、泛化性高的自監督學習在方法,使自監督式模型更好的泛化到下游任務的能力。


    Recent advances in self-supervised learning have shown promise as an alternative to supervised learning, particularly for addressing its critical shortcomings: the need for abundant labeled data and the inability to leverage prior knowledge and skills. Self-supervised learning involves pre-training deep neural networks on pretext tasks using easily acquirable, unlabeled data and then fine-tuning it on downstream tasks of interest, requiring fewer labeled data than supervised learning. Notably, self-supervised learning has demonstrated success in diverse domains, including text, vision, speech, etc.
    In this thesis, we present several novel self-supervised learning methods for visual representation learning that can improve the performance of multiple computer vision downstream tasks. These methods are designed to leverage the input data itself for generating learning targets. Our first method, HAPiCLR, leverages pixel-level information from an object's contextual representation with a contrastive learning objective, allowing it to learn more robust and efficient image representations for downstream tasks. The second method, HARL, introduces a heuristic attention-based approach that maximizes the abstract object-level embedding in vector space, resulting in higher quality semantic representations. Finally, the MVMA framework combines multiple augmentation pipelines and leveraging both global and local information from each training sample, the MVMA framework can explore a vast range of image appearances. This approach results in representations that are not only scale-invariant but also invariant to nuisance-factors, making them more robust and efficient for downstream tasks.
    These methods have notably improved performance in tasks like image classification, object detection, and semantic segmentation. They demonstrate the ability of self-supervised algorithms to transform high-level image properties, thereby enhancing deep neural network efficiency in various computer vision tasks. This thesis not only introduces new learning algorithms but also provides a comprehensive analysis of self-supervised representations and the distinct factors that differentiate various models. Overall, it presents a suite of innovative, adaptable, and efficient approaches to self-supervised learning in image representation, significantly boosting the robustness and effectiveness of learned features.

    List of Contents List of Figures IX List of Tables XII List of Abbreviations XV Chapter I. Introduction 1 1-1. Introduction 1 1-2. Thesis Contributions 6 1-3. Chapter Guide 7 Chapter II. Self-Supervised Learning History Development and Current State 10 2-1. Representation Learning. 10 2-1-1. Foundation Model Representation Learning via Supervised Learning 10 2-1-2. Foundation Model Representation Learning via Self-supervised 11 2-2. History and evolution of self-supervised learning. 13 2-3. Main Categories of Self-supervised Learning 16 2-3-1. Contrastive learning methods 16 2-3-2. Predictive learning Distillation-based methods 17 2-3-3. Redundancy reduction methods 17 2-3-4. Reconstruction Self-supervised methods 18 2-3-5. Generative SSL methods 18 2-4. Research Gaps and Limitations 20 Chapter III. Self-supervised Contrastive Learning on Pixel-Level 21 3-1. Introduction 21 3-2. Related Work 22 3-3. Methodology 23 3-4. Implementation Detail 27 3-4-1. Dataset and image augmentation. 27 3-4-2. Neural Network Architecture. 28 3-4-3. Optimization Objective. 28 3-5. Evaluation Protocol 28 3-5-1. Performance with Linear Evaluation and Semi-supervised Learning on ImageNet Dataset. 28 3-5-2. Transfer Learning to Other Downstream Tasks. 29 3-6. Ablation and Analysis 30 3-6-1. Mask Cropping Strategies. 31 3-6-2. Objective Loss Functions. 32 3-6-3. Batch Size. 33 3-6-4. Projection Head 34 3-7. Chapter Summary 35 3-8. Supplement Section 35 3-8-A. Implementation Details 35 3-8-A-1. Heuristic Mask Proposal Generator 35 3-8-A-2. Implementation: Data Augmentation 36 3-8-B. Evaluation on ImageNet and Transfer Learning 37 3-8-B-1. Linear evaluation semi-supervised protocol on ImageNet. 37 3-8-B-2. Transfer Learning 38 Chapter IV. Heuristic Attention Representation Learning for Predictive Learning Self-Supervised Pretraining 41 4-1. Introduction 41 4-2. Related Work 43 4-3. Methods 44 4-3-1. HARL Framework 44 4-3-2. Heuristic Binary Mask 47 4-4. Experiments 48 4-5. Evaluation Protocol 49 4-5-1. Linear Evaluation and Semi-Supervised Learning on ImageNet Dataset 49 4-5-2. Transfer Learning to Other Downstream Tasks. 50 4-6. Ablation and Analysis 51 4-6-1. The Output of Spatial Feature Map (Size and Dimension) 52 4-6-2. Objective Loss Functions 53 4-6-2-1. Mask loss 54 4-6-2-2. Hybrid loss 54 4-6-2-3. Mask loss versus hybrid loss 55 4-6-3. The Impact of Heuristic Mask Quality 55 4-7. Conclusion 58 4-8. Supplement Implementation Detail 59 4-8-1. Implementation Data Augmentation 59 4-8-2. Implementation Masking Feature 60 4-8-3. Evaluation on the ImageNet and Transfer Learning 61 4-8-3-1. Linear evaluation semi-supervised protocol on ImageNet 61 4-8-3-2. Transfer via linear classification and fine-tuning 62 4-8-3-3. Transfer learning to other vision tasks 62 4-8-4. Heuristic Mask Proposal Methods 63 4-8-4-1. Heuristic binary mask generates using DRFI 63 4-8-4-2. Heuristic binary mask generates using unsupervised deep learning 63 Chapter V. Multi-View and Multi-Augmentation for Self-Supervised Visual Representation Learning 66 5-1. Introduction 66 5-2. Related Work 67 5-2-1. Self-Supervised Learning 67 5-2-2. Cropping Strategy 68 5-2-3. Multi-Cropping 69 5-2-4. Data Augmentation Searching 70 5-3. Methodology 71 5-3-1. Multi-Cropping 72 5-3-2. Multi-Data Augmentation 72 5-3-3. Loss Function 76 5-4. Experiments 79 5-4-1. SSL Pre-training Setup 79 5-4-2. Evaluation Protocol and Main Results 81 5-4-2-1. Evaluation on ImageNet 81 5-4-2-2. Evaluation on multiple natural image classification tasks 82 5-4-2-3. Evaluation on downstream task transfer 82 5-4-2-4. Discovering semantic scene layouts by observing the self-attention map 84 5-5. Ablation Study 86 5-5-1. Global and Local View Crop Ratio and Resolution 86 5-5-1. Number of Cropped Views 86 5-5-2. Number of Augmentation Strategies 88 5-5-3. Global- and Local-View Loss 89 5-6. Supplement Implementation Detail 90 5-6-1. Implement of MVMA multi-data augmentation 90 5-7. Conclusion 96 Chapter VI. Conclusion 97 6-1. Summary 97 6-2. Discussion 98 6-2-1. Implications and Applications of Self-supervised Learning 98 6-2-2. Limitations 99 6-3. Future Direction 100 6-3-1. Improving the Quality of Representation 100 6-3-2. Building Self-Supervised Multi-Modal Models 101 6-3-3. Exploring New Self-Supervised Application Domain 101 Bibliography 103

    1 Tan, M., and Le, Q.V.: ‘EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks’, ArXiv, 2019, abs/1905.11946
    2 He, K., Zhang, X., Ren, S., and Sun, J.: ‘Deep Residual Learning for Image Recognition’, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778
    3 Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I.: ‘Attention is All you Need’, ArXiv, 2017, abs/1706.03762
    4 Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I.: ‘Robust Speech Recognition via Large-Scale Weak Supervision’, ArXiv, 2022, abs/2212.04356
    5 Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., and Yu, D.: ‘Convolutional Neural Networks for Speech Recognition’, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22, pp. 1533-1545
    6 Sun, C., Shrivastava, A., Singh, S., and Gupta, A.K.: ‘Revisiting Unreasonable Effectiveness of Data in Deep Learning Era’, 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 843-852
    7 Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N.: ‘Big Transfer (BiT): General Visual Representation Learning’, in Editor (Ed.)^(Eds.): ‘Book Big Transfer (BiT): General Visual Representation Learning’ (2019, edn.), pp.
    8 LeCun, Y., Bengio, Y., and Hinton, G.: ‘Deep Learning’, Nature, 2015, 521, pp. 436-444
    9 Eslami, S.M.A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A.S., Garnelo, M., Ruderman, A., Rusu, A.A., Danihelka, I., Gregor, K., Reichert, D.P., Buesing, L., Weber, T., Vinyals, O., Rosenbaum, D., Rabinowitz, N.C., King, H., Hillier, C., Botvinick, M.M., Wierstra, D., Kavukcuoglu, K., and Hassabis, D.: ‘Neural scene representation and rendering’, Science, 2018, 360, pp. 1204 - 1210
    10 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., and Fei-Fei, L.: ‘ImageNet Large Scale Visual Recognition Challenge’, International Journal of Computer Vision, 2015, 115, pp. 211-252
    11 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I.: ‘Learning Transferable Visual Models From Natural Language Supervision’, in Editor (Ed.)^(Eds.): ‘Book Learning Transferable Visual Models From Natural Language Supervision’ (2021, edn.), pp.
    12 Misra, Y.L.a.I.: ‘ Self-supervised learning: The dark matter of intelligence.’, in Editor (Ed.)^(Eds.): ‘Book Self-supervised learning: The dark matter of intelligence.’ (2022, edn.), pp.
    13 Chen, T., Kornblith, S., Norouzi, M., and Hinton, G.: ‘A simple framework for contrastive learning of visual representations’, in Editor (Ed.)^(Eds.): ‘Book A simple framework for contrastive learning of visual representations’ (PMLR, 2020, edn.), pp. 1597-1607
    14 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., and Gheshlaghi Azar, M.: ‘Bootstrap your own latent-a new approach to self-supervised learning’, Advances in neural information processing systems, 2020, 33, pp. 21271-21284
    15 Goyal, P., Caron, M., Lefaudeux, B., Xu, M., Wang, P., Pai, V., Singh, M., Liptchinsky, V., Misra, I., Joulin, A., and Bojanowski, P.: ‘Self-supervised Pretraining of Visual Features in the Wild’, ArXiv, 2021, abs/2103.01988
    16 Caron, M., Touvron, H., Misra, I., J'egou, H.e., Mairal, J., Bojanowski, P., and Joulin, A.: ‘Emerging Properties in Self-Supervised Vision Transformers’, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9630-9640
    17 Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., and Zhuang, Y.: ‘Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10326-10335
    18 Alwassel, H., Mahajan, D.K., Torresani, L., Ghanem, B., and Tran, D.: ‘Self-Supervised Learning by Cross-Modal Audio-Video Clustering’, ArXiv, 2019, abs/1911.12667
    19 Baevski, A., Zhou, H., Mohamed, A.-r., and Auli, M.: ‘wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations’, ArXiv, 2020, abs/2006.11477
    20 Gong, Y., Lai, C.-I., Chung, Y.-A., and Glass, J.R.: ‘SSAST: Self-Supervised Audio Spectrogram Transformer’, in Editor (Ed.)^(Eds.): ‘Book SSAST: Self-Supervised Audio Spectrogram Transformer’ (2021, edn.), pp.
    21 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.: ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’, ArXiv, 2019, abs/1810.04805
    22 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.: ‘RoBERTa: A Robustly Optimized BERT Pretraining Approach’, ArXiv, 2019, abs/1907.11692
    23 Xie, Y., Xu, Z., Wang, Z., and Ji, S.: ‘Self-Supervised Learning of Graph Neural Networks: A Unified Review’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 45, pp. 2412-2429
    24 Goyal, P., Mahajan, D.K., Gupta, A.K., and Misra, I.: ‘Scaling and Benchmarking Self-Supervised Visual Representation Learning’, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6390-6399
    25 Goyal, P., Duval, Q., Seessel, I., Caron, M., Misra, I., Sagun, L., Joulin, A., and Bojanowski, P.: ‘Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision’, ArXiv, 2022, abs/2202.08360
    26 Bengio, Y., Courville, A.C., and Vincent, P.: ‘Representation Learning: A Review and New Perspectives’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35, pp. 1798-1828
    27 Bottou, L.: ‘Large-Scale Machine Learning with Stochastic Gradient Descent’, in Editor (Ed.)^(Eds.): ‘Book Large-Scale Machine Learning with Stochastic Gradient Descent’ (2010, edn.), pp.
    28 Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y.: ‘Contractive Auto-Encoders: Explicit Invariance During Feature Extraction’, in Editor (Ed.)^(Eds.): ‘Book Contractive Auto-Encoders: Explicit Invariance During Feature Extraction’ (2011, edn.), pp.
    29 Goldberg, Y., and Levy, O.: ‘word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method’, ArXiv, 2014, abs/1402.3722
    30 Xie, J., Girshick, R.B., and Farhadi, A.: ‘Unsupervised Deep Embedding for Clustering Analysis’, ArXiv, 2015, abs/1511.06335
    31 Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., and Bengio, Y.: ‘Generative Adversarial Nets’, in Editor (Ed.)^(Eds.): ‘Book Generative Adversarial Nets’ (2014, edn.), pp.
    32 Larsson, G., Maire, M., and Shakhnarovich, G.: ‘Colorization as a Proxy Task for Visual Understanding’, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 840-849
    33 Noroozi, M., and Favaro, P.: ‘Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles’ (2016, edn.), pp.
    34 Gidaris, S., Singh, P., and Komodakis, N.: ‘Unsupervised Representation Learning by Predicting Image Rotations’, ArXiv, 2018, abs/1803.07728
    35 Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., and Efros, A.A.: ‘Context Encoders: Feature Learning by Inpainting’, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2536-2544
    36 Oord, A.v.d., Li, Y., and Vinyals, O.: ‘Representation Learning with Contrastive Predictive Coding’, ArXiv, 2018, abs/1807.03748
    37 He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R.B.: ‘Momentum Contrast for Unsupervised Visual Representation Learning’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9726-9735
    38 Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A.: ‘Unsupervised Learning of Visual Features by Contrasting Cluster Assignments’, ArXiv, 2020, abs/2006.09882
    39 He, K., Chen, X., Xie, S., Li, Y., Doll'ar, P., and Girshick, R.B.: ‘Masked Autoencoders Are Scalable Vision Learners’, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15979-15988
    40 Bardes, A., Ponce, J., and LeCun, Y.: ‘VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning’, ArXiv, 2021, abs/2105.04906
    41 Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M.: ‘data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language’, in Editor (Ed.)^(Eds.): ‘Book data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language’ (2022, edn.), pp.
    42 Baevski, A., Babu, A., Hsu, W.-N., and Auli, M.: ‘Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language’, ArXiv, 2022, abs/2212.07525
    43 Misra, I., and Maaten, L.v.d.: ‘Self-Supervised Learning of Pretext-Invariant Representations’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6706-6716
    44 Caron, M., Bojanowski, P., Joulin, A., and Douze, M.: ‘Deep Clustering for Unsupervised Learning of Visual Features’, in Editor (Ed.)^(Eds.): ‘Book Deep Clustering for Unsupervised Learning of Visual Features’ (2018, edn.), pp.
    45 Cuturi, M.: ‘Sinkhorn Distances: Lightspeed Computation of Optimal Transport’, in Editor (Ed.)^(Eds.): ‘Book Sinkhorn Distances: Lightspeed Computation of Optimal Transport’ (2013, edn.), pp.
    46 Hinton, G.E., Vinyals, O., and Dean, J.: ‘Distilling the Knowledge in a Neural Network’, ArXiv, 2015, abs/1503.02531
    47 Chen, X., and He, K.: ‘Exploring Simple Siamese Representation Learning’, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 15745-15753
    48 Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., and Pérez, P.: ‘OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning’, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6826-6836
    49 Ermolov, A., Siarohin, A., Sangineto, E., and Sebe, N.: ‘Whitening for Self-Supervised Representation Learning’, in Editor (Ed.)^(Eds.): ‘Book Whitening for Self-Supervised Representation Learning’ (2020, edn.), pp.
    50 Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S.: ‘Barlow Twins: Self-Supervised Learning via Redundancy Reduction’, in Editor (Ed.)^(Eds.): ‘Book Barlow Twins: Self-Supervised Learning via Redundancy Reduction’ (2021, edn.), pp.
    51 Radford, A., and Narasimhan, K.: ‘Improving Language Understanding by Generative Pre-Training’, in Editor (Ed.)^(Eds.): ‘Book Improving Language Understanding by Generative Pre-Training’ (2018, edn.), pp.
    52 Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I.: ‘Language Models are Unsupervised Multitask Learners’, in Editor (Ed.)^(Eds.): ‘Book Language Models are Unsupervised Multitask Learners’ (2019, edn.), pp.
    53 Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T.J., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.: ‘Language Models are Few-Shot Learners’, ArXiv, 2020, abs/2005.14165
    54 Chen, M., Radford, A., Wu, J., Jun, H., Dhariwal, P., Luan, D., and Sutskever, I.: ‘Generative Pretraining From Pixels’, in Editor (Ed.)^(Eds.): ‘Book Generative Pretraining From Pixels’ (2020, edn.), pp.
    55 Bao, H., Dong, L., and Wei, F.: ‘BEiT: BERT Pre-Training of Image Transformers’, ArXiv, 2021, abs/2106.08254
    56 Kingma, D.P., and Welling, M.: ‘Auto-Encoding Variational Bayes’, CoRR, 2013, abs/1312.6114
    57 Sohl-Dickstein, J.N., Weiss, E.A., Maheswaranathan, N., and Ganguli, S.: ‘Deep Unsupervised Learning using Nonequilibrium Thermodynamics’, ArXiv, 2015, abs/1503.03585
    58 Ho, J., Jain, A., and Abbeel, P.: ‘Denoising Diffusion Probabilistic Models’, ArXiv, 2020, abs/2006.11239
    59 Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G.E.: ‘Big self-supervised models are strong semi-supervised learners’, Advances in neural information processing systems, 2020, 33, pp. 22243-22255
    60 Bardes, A., Ponce, J., and LeCun, Y.: ‘Vicreg: Variance-invariance-covariance regularization for self-supervised learning’, arXiv preprint arXiv:2105.04906, 2021
    61 Bachman, P., Hjelm, R.D., and Buchwalter, W.: ‘Learning representations by maximizing mutual information across views’, Advances in neural information processing systems, 2019, 32
    62 Misra, I., and Maaten, L.v.d.: ‘Self-supervised learning of pretext-invariant representations’, in Editor (Ed.)^(Eds.): ‘Book Self-supervised learning of pretext-invariant representations’ (2020, edn.), pp. 6707-6717
    63 He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R.: ‘Momentum contrast for unsupervised visual representation learning’, in Editor (Ed.)^(Eds.): ‘Book Momentum contrast for unsupervised visual representation learning’ (2020, edn.), pp. 9729-9738
    64 Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P.: ‘What makes for good views for contrastive learning?’, Advances in Neural Information Processing Systems, 2020, 33, pp. 6827-6839
    65 Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A.: ‘Unsupervised learning of visual features by contrasting cluster assignments’, Advances in Neural Information Processing Systems, 2020, 33, pp. 9912-9924
    66 Chen, X., and He, K.: ‘Exploring simple siamese representation learning’, in Editor (Ed.)^(Eds.): ‘Book Exploring simple siamese representation learning’ (2021, edn.), pp. 15750-15758
    67 Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., and Pérez, P.: ‘Online bag-of-visual-words generation for unsupervised representation learning’, arXiv preprint arXiv:2012.11552, 2020
    68 Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S.: ‘Barlow twins: Self-supervised learning via redundancy reduction’, in Editor (Ed.)^(Eds.): ‘Book Barlow twins: Self-supervised learning via redundancy reduction’ (PMLR, 2021, edn.), pp. 12310-12320
    69 Putri, W.R., Liu, S.-H., Aslam, M.S., Li, Y.-H., Chang, C.-C., and Wang, J.-C.: ‘Self-Supervised Learning Framework toward State-of-the-Art Iris Image Segmentation’, Sensors, 2022, 22, (6), pp. 2133
    70 Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, R.: ‘Signature verification using a" siamese" time delay neural network’, Advances in neural information processing systems, 1993, 6
    71 Chopra, S., Hadsell, R., and LeCun, Y.: ‘Learning a similarity metric discriminatively, with application to face verification’, in Editor (Ed.)^(Eds.): ‘Book Learning a similarity metric discriminatively, with application to face verification’ (IEEE, 2005, edn.), pp. 539-546
    72 Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y.: ‘Learning deep representations by mutual information estimation and maximization’, arXiv preprint arXiv:1808.06670, 2018
    73 Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., and Hu, H.: ‘Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning’, in Editor (Ed.)^(Eds.): ‘Book Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning’ (2021, edn.), pp. 16684-16693
    74 Van Gansbeke, W., Vandenhende, S., Georgoulis, S., and Van Gool, L.: ‘Unsupervised semantic segmentation by contrasting object mask proposals’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised semantic segmentation by contrasting object mask proposals’ (2021, edn.), pp. 10052-10062
    75 Wang, X., Zhang, R., Shen, C., Kong, T., and Li, L.: ‘Dense contrastive learning for self-supervised visual pre-training’, in Editor (Ed.)^(Eds.): ‘Book Dense contrastive learning for self-supervised visual pre-training’ (2021, edn.), pp. 3024-3033
    76 Iizuka, S., Simo-Serra, E., and Ishikawa, H.: ‘Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification’, ACM Transactions on Graphics (ToG), 2016, 35, (4), pp. 1-11
    77 Larsson, G., Maire, M., and Shakhnarovich, G.: ‘Colorization as a proxy task for visual understanding’, in Editor (Ed.)^(Eds.): ‘Book Colorization as a proxy task for visual understanding’ (2017, edn.), pp. 6874-6883
    78 Zhang, R., Isola, P., and Efros, A.A.: ‘Colorful image colorization’, in Editor (Ed.)^(Eds.): ‘Book Colorful image colorization’ (Springer, 2016, edn.), pp. 649-666
    79 Doersch, C., Gupta, A., and Efros, A.A.: ‘Unsupervised visual representation learning by context prediction’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised visual representation learning by context prediction’ (2015, edn.), pp. 1422-1430
    80 Mundhenk, T.N., Ho, D., and Chen, B.Y.: ‘Improvements to context based self-supervised learning’, in Editor (Ed.)^(Eds.): ‘Book Improvements to context based self-supervised learning’ (2018, edn.), pp. 9339-9348
    81 Noroozi, M., and Favaro, P.: ‘Unsupervised learning of visual representations by solving jigsaw puzzles’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised learning of visual representations by solving jigsaw puzzles’ (Springer, 2016, edn.), pp. 69-84
    82 Noroozi, M., Vinjimoor, A., Favaro, P., and Pirsiavash, H.: ‘Boosting self-supervised learning via knowledge transfer’, in Editor (Ed.)^(Eds.): ‘Book Boosting self-supervised learning via knowledge transfer’ (2018, edn.), pp. 9359-9367
    83 Ren, Z., and Lee, Y.J.: ‘Cross-domain self-supervised multi-task feature learning using synthetic imagery’, in Editor (Ed.)^(Eds.): ‘Book Cross-domain self-supervised multi-task feature learning using synthetic imagery’ (2018, edn.), pp. 762-771
    84 Asano, Y., Patrick, M., Rupprecht, C., and Vedaldi, A.: ‘Labelling unlabelled videos from scratch with multi-modal self-supervision’, Advances in Neural Information Processing Systems, 2020, 33, pp. 4660-4671
    85 Caron, M., Bojanowski, P., Joulin, A., and Douze, M.: ‘Deep clustering for unsupervised learning of visual features’, in Editor (Ed.)^(Eds.): ‘Book Deep clustering for unsupervised learning of visual features’ (2018, edn.), pp. 132-149
    86 Yan, X., Misra, I., Gupta, A., Ghadiyaram, D., and Mahajan, D.: ‘Clusterfit: Improving generalization of visual representations’, in Editor (Ed.)^(Eds.): ‘Book Clusterfit: Improving generalization of visual representations’ (2020, edn.), pp. 6509-6518
    87 Bojanowski, P., and Joulin, A.: ‘Unsupervised learning by predicting noise’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised learning by predicting noise’ (PMLR, 2017, edn.), pp. 517-526
    88 Jenni, S., and Favaro, P.: ‘Self-supervised feature learning by learning to spot artifacts’, in Editor (Ed.)^(Eds.): ‘Book Self-supervised feature learning by learning to spot artifacts’ (2018, edn.), pp. 2733-2742
    89 Donahue, J., Krähenbühl, P., and Darrell, T.: ‘Adversarial feature learning’, arXiv preprint arXiv:1605.09782, 2016
    90 Donahue, J., and Simonyan, K.: ‘Large scale adversarial representation learning’, Advances in neural information processing systems, 2019, 32
    91 Mahendran, A., Thewlis, J., and Vedaldi, A.: ‘Cross pixel optical-flow similarity for self-supervised learning’, in Editor (Ed.)^(Eds.): ‘Book Cross pixel optical-flow similarity for self-supervised learning’ (Springer, 2018, edn.), pp. 99-116
    92 Zhan, X., Pan, X., Liu, Z., Lin, D., and Loy, C.C.: ‘Self-supervised learning via conditional motion propagation’, in Editor (Ed.)^(Eds.): ‘Book Self-supervised learning via conditional motion propagation’ (2019, edn.), pp. 1881-1889
    93 Noroozi, M., Pirsiavash, H., and Favaro, P.: ‘Representation learning by learning to count’, in Editor (Ed.)^(Eds.): ‘Book Representation learning by learning to count’ (2017, edn.), pp. 5898-5906
    94 Gidaris, S., Singh, P., and Komodakis, N.: ‘Unsupervised representation learning by predicting image rotations’, arXiv preprint arXiv:1803.07728, 2018
    95 Zhang, L., Qi, G.-J., Wang, L., and Luo, J.: ‘Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data’, in Editor (Ed.)^(Eds.): ‘Book Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data’ (2019, edn.), pp. 2547-2555
    96 Chaitanya, K., Erdil, E., Karani, N., and Konukoglu, E.: ‘Contrastive learning of global and local features for medical image segmentation with limited annotations’, Advances in Neural Information Processing Systems, 2020, 33, pp. 12546-12558
    97 Hadsell, R., Chopra, S., and LeCun, Y.: ‘Dimensionality reduction by learning an invariant mapping’, in Editor (Ed.)^(Eds.): ‘Book Dimensionality reduction by learning an invariant mapping’ (IEEE, 2006, edn.), pp. 1735-1742
    98 Li, J., Zhou, P., Xiong, C., and Hoi, S.C.: ‘Prototypical contrastive learning of unsupervised representations’, arXiv preprint arXiv:2005.04966, 2020
    99 Tian, Y., Krishnan, D., and Isola, P.: ‘Contrastive multiview coding’, in Editor (Ed.)^(Eds.): ‘Book Contrastive multiview coding’ (Springer, 2020, edn.), pp. 776-794
    100 Wu, Z., Xiong, Y., Yu, S.X., and Lin, D.: ‘Unsupervised feature learning via non-parametric instance discrimination’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised feature learning via non-parametric instance discrimination’ (2018, edn.), pp. 3733-3742
    101 Ye, M., Zhang, X., Yuen, P.C., and Chang, S.-F.: ‘Unsupervised embedding learning via invariant and spreading instance feature’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised embedding learning via invariant and spreading instance feature’ (2019, edn.), pp. 6210-6219
    102 Zhan, X., Liu, Z., Luo, P., Tang, X., and Loy, C.: ‘Mix-and-match tuning for self-supervised semantic segmentation’, in Editor (Ed.)^(Eds.): ‘Book Mix-and-match tuning for self-supervised semantic segmentation’ (2018, edn.), pp.
    103 Oord, A.v.d., Li, Y., and Vinyals, O.: ‘Representation learning with contrastive predictive coding’, arXiv preprint arXiv:1807.03748, 2018
    104 Chen, X., Fan, H., Girshick, R., and He, K.: ‘Improved baselines with momentum contrastive learning’, arXiv preprint arXiv:2003.04297, 2020
    105 Henaff, O.: ‘Data-efficient image recognition with contrastive predictive coding’, in Editor (Ed.)^(Eds.): ‘Book Data-efficient image recognition with contrastive predictive coding’ (PMLR, 2020, edn.), pp. 4182-4192
    106 Zhuang, C., Zhai, A.L., and Yamins, D.: ‘Local aggregation for unsupervised learning of visual embeddings’, in Editor (Ed.)^(Eds.): ‘Book Local aggregation for unsupervised learning of visual embeddings’ (2019, edn.), pp. 6002-6012
    107 Cao, Y., Xie, Z., Liu, B., Lin, Y., Zhang, Z., and Hu, H.: ‘Parametric instance classification for unsupervised visual feature learning’, Advances in neural information processing systems, 2020, 33, pp. 15614-15624
    108 Ioffe, S., and Szegedy, C.: ‘Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift’, ArXiv, 2015, abs/1502.03167
    109 Nair, V., and Hinton, G.E.: ‘Rectified Linear Units Improve Restricted Boltzmann Machines’, in Editor (Ed.)^(Eds.): ‘Book Rectified Linear Units Improve Restricted Boltzmann Machines’ (2010, edn.), pp.
    110 Nguyen, D.T., Dax, M., Mummadi, C.K., Ngo, T.-P.-N., Nguyen, T.H.P., Lou, Z., and Brox, T.: ‘DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision’, in Editor (Ed.)^(Eds.): ‘Book DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision’ (2019, edn.), pp.
    111 Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.: ‘Going deeper with convolutions’, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9
    112 Zhang, S., Liew, J.H., Wei, Y., Wei, S., and Zhao, Y.: ‘Interactive Object Segmentation With Inside-Outside Guidance’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12231-12241
    113 You, Y., Gitman, I., and Ginsburg, B.: ‘Scaling SGD Batch Size to 32K for ImageNet Training’, ArXiv, 2017, abs/1708.03888
    114 Loshchilov, I., and Hutter, F.: ‘SGDR: Stochastic Gradient Descent with Warm Restarts’, arXiv: Learning, 2017
    115 Goyal, P., Doll·r, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K.: ‘Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour’, ArXiv, 2017, abs/1706.02677
    116 Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., and Zisserman, A.: ‘The Pascal Visual Object Classes (VOC) Challenge’, International Journal of Computer Vision, 2009, 88, pp. 303-338
    117 Ren, S., He, K., Girshick, R.B., and Sun, J.: ‘Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39, pp. 1137-1149
    118 Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll·r, P., and Zitnick, C.L.: ‘Microsoft COCO: Common Objects in Context’, in Editor (Ed.)^(Eds.): ‘Book Microsoft COCO: Common Objects in Context’ (2014, edn.), pp.
    119 He, K., Gkioxari, G., Doll·r, P., and Girshick, R.B.: ‘Mask R-CNN’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42, pp. 386-397
    120 Bossard, L., Guillaumin, M., and Gool, L.V.: ‘Food-101 - Mining Discriminative Components with Random Forests’, in Editor (Ed.)^(Eds.): ‘Book Food-101 - Mining Discriminative Components with Random Forests’ (2014, edn.), pp.
    121 Krizhevsky, A.: ‘Learning Multiple Layers of Features from Tiny Images’, in Editor (Ed.)^(Eds.): ‘Book Learning Multiple Layers of Features from Tiny Images’ (2009, edn.), pp.
    122 Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., and Torralba, A.: ‘SUN database: Large-scale scene recognition from abbey to zoo’, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485-3492
    123 Krause, J., Stark, M., Deng, J., and Fei-Fei, L.: ‘3D Object Representations for Fine-Grained Categorization’, 2013 IEEE International Conference on Computer Vision Workshops, 2013, pp. 554-561
    124 Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A.: ‘Describing Textures in the Wild’, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606-3613
    125 Shu, Y., Kou, Z., Cao, Z., Wang, J., and Long, M.: ‘Zoo-Tuning: Adaptive Transfer from a Zoo of Models’, ArXiv, 2021, abs/2106.15434
    126 Yang, Q., Zhang, Y., Dai, W., and Pan, S.J.: ‘Transfer learning’ (Cambridge University Press, 2020. 2020)
    127 You, K., Kou, Z., Long, M., and Wang, J.: ‘Co-Tuning for Transfer Learning’, in Editor (Ed.)^(Eds.): ‘Book Co-Tuning for Transfer Learning’ (2020, edn.), pp.
    128 Misra, I., Shrivastava, A., Gupta, A., and Hebert, M.: ‘Cross-Stitch Networks for Multi-task Learning’, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3994-4003
    129 Li, X., Xiong, H., Xu, C., and Dou, D.: ‘SMILE: Self-Distilled MIxup for Efficient Transfer LEarning’, ArXiv, 2021, abs/2103.13941
    130 Tishby, N., and Zaslavsky, N.: ‘Deep learning and the information bottleneck principle’, 2015 IEEE Information Theory Workshop (ITW), 2015, pp. 1-5
    131 Shwartz-Ziv, R., and Tishby, N.: ‘Opening the Black Box of Deep Neural Networks via Information’, ArXiv, 2017, abs/1703.00810
    132 Amjad, R.A., and Geiger, B.C.: ‘Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42, pp. 2225-2239
    133 Chen, T., Kornblith, S., Norouzi, M., and Hinton, G.E.: ‘A Simple Framework for Contrastive Learning of Visual Representations’, ArXiv, 2020, abs/2002.05709
    134 Misra, I., and Maaten, L.v.d.: ‘Self-Supervised Learning of Pretext-Invariant Representations’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6706-6716
    135 Ermolov, A., Siarohin, A., Sangineto, E., and Sebe, N.: ‘Whitening for Self-Supervised Representation Learning’, in Editor (Ed.)^(Eds.): ‘Book Whitening for Self-Supervised Representation Learning’ (2021, edn.), pp.
    136 Caron, M., Touvron, H., Misra, I., J'egou, H.e., Mairal, J., Bojanowski, P., and Joulin, A.: ‘Emerging Properties in Self-Supervised Vision Transformers’, ArXiv, 2021, abs/2104.14294
    137 Hayhoe, M.M., and Ballard, D.H.: ‘Eye movements in natural behavior’, Trends in Cognitive Sciences, 2005, 9, pp. 188-194
    138 BorjiAli, SihiteDicky, N., and IttiLaurent: ‘Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling’, IEEE Transactions on Image Processing, 2013
    139 Benois-Pineau, J., and Callet, P.L.: ‘Visual Content Indexing and Retrieval with Psycho-Visual Models’, in Editor (Ed.)^(Eds.): ‘Book Visual Content Indexing and Retrieval with Psycho-Visual Models’ (2017, edn.), pp.
    140 Awh, E., Armstrong, K.M., and Moore, T.: ‘Visual and oculomotor selection: links, causes and implications for spatial attention’, Trends in Cognitive Sciences, 2006, 10, pp. 124-130
    141 Tian, Y., Chen, X., and Ganguli, S.: ‘Understanding self-supervised Learning Dynamics without Contrastive Pairs’, ArXiv, 2021, abs/2102.06810
    142 Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.: ‘Extracting and composing robust features with denoising autoencoders’, in Editor (Ed.)^(Eds.): ‘Book Extracting and composing robust features with denoising autoencoders’ (2008, edn.), pp.
    143 Bojanowski, P., and Joulin, A.: ‘Unsupervised Learning by Predicting Noise’, ArXiv, 2017, abs/1704.05310
    144 Noroozi, M., and Favaro, P.: ‘Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles’, in Editor (Ed.)^(Eds.): ‘Book Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles’ (2016, edn.), pp.
    145 Zhang, R., Isola, P., and Efros, A.A.: ‘Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction’, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 645-654
    146 Mundhenk, T.N., Ho, D., and Chen, B.Y.: ‘Improvements to Context Based Self-Supervised Learning’, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 9339-9348
    147 Donahue, J., and Simonyan, K.: ‘Large Scale Adversarial Representation Learning’, in Editor (Ed.)^(Eds.): ‘Book Large Scale Adversarial Representation Learning’ (2019, edn.), pp.
    148 Bansal, V., Buckchash, H., and Raman, B.: ‘Discriminative Auto-Encoding for Classification and Representation Learning Problems’, IEEE Signal Processing Letters, 2021, 28, pp. 987-991
    149 Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G.E.: ‘Big Self-Supervised Models are Strong Semi-Supervised Learners’, ArXiv, 2020, abs/2006.10029
    150 Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., and Makedon, F.: ‘A Survey on Contrastive Self-supervised Learning’, ArXiv, 2020, abs/2011.00362
    151 He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R.B.: ‘Momentum Contrast for Unsupervised Visual Representation Learning’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9726-9735
    152 Zhang, X., and Maire, M.: ‘Self-Supervised Visual Representation Learning from Hierarchical Grouping’, ArXiv, 2020, abs/2012.03044
    153 Jiang, H., Yuan, Z., Cheng, M.-M., Gong, Y., Zheng, N., and Wang, J.: ‘Salient Object Detection: A Discriminative Regional Feature Integration Approach’, International Journal of Computer Vision, 2013, 123, pp. 251-268
    154 Kolesnikov, A., Zhai, X., and Beyer, L.: ‘Revisiting Self-Supervised Visual Representation Learning’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1920-1929
    155 Ye, M., Zhang, X., Yuen, P., and Chang, S.-F.: ‘Unsupervised Embedding Learning via Invariant and Spreading Instance Feature’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6203-6212
    156 Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., and Bengio, Y.: ‘Learning deep representations by mutual information estimation and maximization’, ArXiv, 2019, abs/1808.06670
    157 Kornblith, S., Shlens, J., and Le, Q.V.: ‘Do Better ImageNet Models Transfer Better?’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2656-2666
    158 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M.: ‘Bootstrap your own latent a new approach to self-supervised learning’. Proc. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada2020 pp. Pages
    159 Chen, X., and He, K.: ‘Exploring Simple Siamese Representation Learning’, in Editor (Ed.)^(Eds.): ‘Book Exploring Simple Siamese Representation Learning’ (2021, edn.), pp.
    160 Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., and Hu, H.: ‘Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning’, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16679-16688
    161 Chen, X., Fan, H., Girshick, R.B., and He, K.: ‘Improved Baselines with Momentum Contrastive Learning’, ArXiv, 2020, abs/2003.04297
    162 HÈnaff, O.J., Srinivas, A., Fauw, J.D., Razavi, A., Doersch, C., Eslami, S.M.A., and Oord, A.r.v.d.: ‘Data-Efficient Image Recognition with Contrastive Predictive Coding’, ArXiv, 2020, abs/1905.09272
    163 Borji, A., Cheng, M.-M., Jiang, H., and Li, J.: ‘Salient Object Detection: A Benchmark’, IEEE Transactions on Image Processing, 2015, 24, pp. 5706-5722
    164 Wang, W., Lai, Q., Fu, H., Shen, J., and Ling, H.: ‘Salient Object Detection in the Deep Learning Era: An In-Depth Survey’, IEEE transactions on pattern analysis and machine intelligence, 2021, PP
    165 Zou, W., and Komodakis, N.: ‘HARF: Hierarchy-Associated Rich Features for Salient Object Detection’, 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 406-414
    166 Zhang, J., Zhang, T., Dai, Y., Harandi, M., and Hartley, R.I.: ‘Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective’, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 9029-9038
    167 Van Gansbeke, W., Vandenhende, S., Georgoulis, S., and Gool, L.V.: ‘Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals’, ArXiv, 2021, abs/2102.06191
    168 Chen, T., Kornblith, S., Norouzi, M., and Hinton, G.: ‘A Simple Framework for Contrastive Learning of Visual Representations’. Proc. Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research2020 pp. Pages
    169 Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A.: ‘Unsupervised learning of visual features by contrasting cluster assignments’. Proc. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada2020 pp. Pages
    170 Zhao, Z., Zhang, Z., Chen, T., Singh, S., and Zhang, H.: ‘Image Augmentations for GAN Training’, ArXiv, 2020, abs/2006.02595
    171 Howard, A.G.: ‘Some Improvements on Deep Convolutional Neural Network Based Image Classification’, CoRR, 2014, abs/1312.5402
    172 Cubuk, E.D., Zoph, B., ManÈ, D., Vasudevan, V., and Le, Q.V.: ‘AutoAugment: Learning Augmentation Strategies From Data’, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113-123
    173 Cubuk, E.D., Zoph, B., Shlens, J., and Le, Q.V.: ‘Randaugment: Practical automated data augmentation with a reduced search space’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 3008-3017
    174 Lim, S., Kim, I., Kim, T., Kim, C., and Kim, S.: ‘Fast AutoAugment’, in Editor (Ed.)^(Eds.): ‘Book Fast AutoAugment’ (2019, edn.), pp.
    175 Caron, M., Bojanowski, P., Joulin, A., and Douze, M.: ‘Deep Clustering for Unsupervised Learning of Visual Features’, in Editor (Ed.)^(Eds.): ‘Book Deep Clustering for Unsupervised Learning of Visual Features’ (2018, edn.), pp.
    176 Richemond, P.H., Grill, J.-B., Altché, F., Tallec, C., Strub, F., Brock, A., Smith, S., De, S., Pascanu, R., and Piot, B.: ‘BYOL works even without batch statistics’, arXiv preprint arXiv:2010.10241, 2020
    177 Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., and Hu, H.: ‘SimMIM: a Simple Framework for Masked Image Modeling’, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9643-9653
    178 Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A.L., and Kong, T.: ‘iBOT: Image BERT Pre-Training with Online Tokenizer’, ArXiv, 2021, abs/2111.07832
    179 Oquab, M., Darcet, T.e., Moutakanni, T., Vo, H.Q., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M.G., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P.: ‘DINOv2: Learning Robust Visual Features without Supervision’, ArXiv, 2023, abs/2304.07193
    180 Tran, V.-N., Huang, C.-E., Liu, S., Yang, K.-L., Ko, T., and Li, Y.-h.: ‘Multi-Augmentation for Efficient Self-Supervised Visual Representation Learning’, 2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2022, pp. 1-4
    181 Krizhevsky, A., Sutskever, I., and Hinton, G.E.: ‘ImageNet classification with deep convolutional neural networks’, Communications of the ACM, 2012, 60, pp. 84 - 90
    182 Touvron, H., Vedaldi, A., Douze, M., and Jégou, H.: ‘Fixing the train-test resolution discrepancy’, Advances in neural information processing systems, 2019, 32
    183 Jones, D.R.: ‘A Taxonomy of Global Optimization Methods Based on Response Surfaces’, Journal of Global Optimization, 2001, 21, pp. 345-383
    184 Reed, C., Metzger, S., Srinivas, A., Darrell, T., and Keutzer, K.: ‘SelfAugment: Automatic Augmentation Policies for Self-Supervised Learning’, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2673-2682
    185 Radosavovic, I., Kosaraju, R.P., Girshick, R.B., He, K., and Dollár, P.: ‘Designing Network Design Spaces’, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10425-10433
    186 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.: ‘An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale’, ArXiv, 2021, abs/2010.11929
    187 Salimans, T., and Kingma, D.P.: ‘Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks’, in Editor (Ed.)^(Eds.): ‘Book Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks’ (2016, edn.), pp.
    188 Loshchilov, I., and Hutter, F.: ‘Fixing Weight Decay Regularization in Adam’, ArXiv, 2017, abs/1711.05101
    189 Chen, X., Xie, S., and He, K.: ‘An Empirical Study of Training Self-Supervised Vision Transformers’, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9620-9629
    190 Lin, T.-Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., and Belongie, S.J.: ‘Feature Pyramid Networks for Object Detection’, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936-944
    191 \url{https://github.com/facebookresearch/detectron2, accessed 2023/11/24 2023
    192 Lin, T.-Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., and Belongie, S.J.: ‘Feature Pyramid Networks for Object Detection’, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 936-944
    193 \url{https://github.com/facebookresearch/detectron, accessed 2023/11/25 2023
    194 Li, Y., Mao, H., Girshick, R.B., and He, K.: ‘Exploring Plain Vision Transformer Backbones for Object Detection’, ArXiv, 2022, abs/2203.16527
    195 Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., and Gool, L.V.: ‘The 2017 DAVIS Challenge on Video Object Segmentation’, ArXiv, 2017, abs/1704.00675
    196 Jabri, A., Owens, A., and Efros, A.A.: ‘Space-Time Correspondence as a Contrastive Random Walk’, ArXiv, 2020, abs/2006.14613
    197 Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., and Batra, D.: ‘Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization’, International Journal of Computer Vision, 2017, 128, pp. 336-359

    QR CODE
    :::