利用注意力插件改善卷積網路：使用前置與後置方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳佳霖 Chia-Lin Wu
論文名稱：	利用注意力插件改善卷積網路：使用前置與後置方法 Attention-based plugin for CNN improvement: Front end and Back end
指導教授：	施國琛 Guo-Chen Shih
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	71
中文關鍵詞：	卷積網路、插件、注意力模型
外文關鍵詞：	CNN, plugin, attention model
相關次數：	點閱：14 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

卷積神經網絡處理的一個常見的任務是圖像分類任務，且其模型結構可以進一步擴展到不同類型的工作。例如，影像語意分割與對象檢測都基於類似於處理分類問題的卷積網路架構。基於卷積神經網絡提供的特徵識別能力，卷積神經網絡在處理這些任務時，與其他傳統方法相比具有一定的性能上的提升。大多數卷積神經網絡的設計，通常將原始圖像作為這些任務的訓練和測試階段的輸入信息。因為在電腦視覺的技術中，特徵的擷取與選擇並不總是可預期的，藉由卷積網路自身的學習能力能夠提取到更適合的特徵。當任務的描述目標未覆蓋整個圖像時，卷積神經網絡可能會在訓練時將部分非正確的特徵納入預測考量。為了提高卷積神經網絡模型的正確性和穩定性，並且不遺漏任何隱含的圖像信息，我們嘗試將專注遮罩資訊以數種不同的形式提供給深度學習模型。為了後續實驗的比較，我們採取了兩個主要想法去設計各個方法。第一個是前置方法，這類型的方法會以不同形式提供專注資訊給模型的輸入階段。主要是在模型的輸入階段為更好的預測結果提供了額外的附加特徵。另一種後置方法，是為了提高判斷正確位置的能力，在訓練階段應用額外的子訓練任務。相比之下，第二種種類的方法，為我們的實驗的目標任務提供了更合理的改進和兼容性。

A general task that convolutional neural network(CNN) dealing is image classification, and the model structure has been further extended to different kinds of works. For example, both semi-segmentation and object detection are based on slimier technics that solve the classification problem. Based on the pattern recognition ability that CNN provided, it can provide more performance improvement compared to other traditional methods. Most of the CNN design usually takes a raw image as input information on both training and testing phase of these tasks, because the suitable feature in computer vision is not always predictable. When the describing target of a task is not covering the entire image, the CNN model will be free to learn any pattern that might not be the right patterns of the target objects. For increasing the correctness and the robustness of a CNN model and not losing any possible information of an image, we attempt to assign attention information to the model. For comparison, there are two groups of methods we are using. The front end which assigns the attention information in different forms provides an additional feature for the prediction. Another end aims to increase the ability of judgment on correct positions, that applies an additional loss function on the training phase. For comparison, the second end provides more reasonable improvements and compatibility on our experimental results.

Introduction    1
Related work    5
1 Features extraction methods    5
1.1 Residual neural network    6
1.2 ResNeXt    7
1.3 MobileNet    9
1.4 SSD: Single Shot Multi-Box Detector    10
1.5 FPN: feature pyramid networks    11
2 Visual Attention methods    13
2.1 Show, Attend and Tell: Neural Image Caption    14
2.2 Residual Attention Network for Image Classification    15
2.3 Interpretable Convolutional Neural Networks    17
2.4 Dual Attention Network for Scene Segmentation    18
2.5 Pedestrians detection via Simultaneous Detection & Segmentation    20
Architecture    22
1 Attention mask    22
1.1 Definition of the attention mask    23
1.2 The overlapping between the bounding boxes    24
2 Front: Additional attention information    25
2.1 An additional information for a pre-trained model    26
2.2 Additional input path    28
2.3 Weighted fusion    29
3 Back: Variance Loss function    31
3.1 A new method for feature map visualization    31
3.2 Variance loss function    38
3.3 Modification for object detection    40
4 Classification task experiments    42
4.1 CUB200    45
4.2 Stanford Dog    49
5 Object detection task    51
5.1 PASCAL-VOC    51
Conclusion    56
Reference    58


                                

[1] A. Krizhevsky, I. Sutskever, and G. Hinton. “Imagenet classification with deep convolutional neural networks.” In NIPS, 2012.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. “Imagenet large scale visual recognition challenge.” arXiv:1409.0575, 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun. “Identity mappings in deep residual networks.” In ECCV, 2016
[4] K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition.” In CVPR, 2016.
[5] S. Zagoruyko and N. Komodakis. “Wide residual networks.” arXiv:1605.07146, 2016.
[6] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. “Aggregated residual transformations for deep neural networks.” In CVPR, 2017.
[7] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. “Densely connected convolutional networks.”, In CVPR, 2017.
[8] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. “Inceptionv4, inception-resnet and the impact of residual connections on learning.”, In ICLR Workshop, 2016.
[9] X. Zhang, X. Zhou, M. Lin, and J. Sun. “Shufflenet: An extremely efficient convolutional neural network for mobile devices.”, arXiv:1707.01083, 2017
[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.”, arXiv:1704.04861, 2017.
[11] D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network.” arXiv:1406.2283, 2014.
[12] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” arXiv:1411.4734, 2014
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” In ICML, 2015.
[14] Q. Zhang, Y. N. Wu, and S.-C. Zhu. “Interpretable convolutional neural networks.” In CVPR, 2018.
[15] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition.” In ICLR, 2015.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. “Going deeper with convolutions.” In CVPR, 2015
[17] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Cham, pp. 234-241, 2015.
[18] Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich “Feature hierarchies for accurate object detection and semantic segmentation.”, CVPR, 2014.
[19] R. Girshick. “Fast R-CNN. “In ICCV, 2015.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: Single shot multibox detector,” arXiv:1512.02325, 2015.
[21] M. Liang and X. Hu. “Recurrent convolutional neural network for object recognition.” In CVPR, 2015.
[22] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. “Feature pyramid networks for object detection.” In CVPR, 2017.
[23] H. Zheng, J. Fu, T. Mei, and J. Luo. “Learning multi-attention convolutional neural network for fine-grained image recognition.” In ICCV, 2017.
[24] Brazil, G., Yin, X., Liu, X.: “Illuminating Pedestrians via Simultaneous Detection & Segmentation.” In ICCV, 2017.
[25] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. “Residual attention network for image classification.” In CVPR, 2017.
[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. “The Caltech-UCSD Birds-200-2011 Dataset.” Technical Report CNS-TR-2011-001, California Institute of Technology, 2011
[27] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. “Dual attention network for scene segmentation.” arXiv:1809.02983, 2018.
[28] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao and Li Fei-Fei. “Novel dataset for Fine-Grained Image Categorization. First Workshop on Fine-Grained Visual Categorization (FGVC)”, In CVPR, 2011.
[29] J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks.” arXiv:1709.01507, 2017.
[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database.” In CVPR, 2009.
[31] S. Ren, K. He, R. Girshick, and J. Sun. “Faster R-CNN: Towards real-time object detection with region proposal networks.”, In NIPS, 2015.
[32] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. “Focal loss for dense object detection.”, In ICCV, 2017
[33] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. “The PASCAL Visual Object Classes Challenge: A Retrospective”, International Journal of Computer Vision, 88(2), 303-338, 2010

簡易檢索 / 詳目顯示

相關論文