跳到主要內容

簡易檢索 / 詳目顯示

研究生: 莊華澎
Hua-Peng Chuang
論文名稱: 自我注意力殘差U網路的物體表面瑕疵分割
Self-attention residual U-Net for surface defect segmentation
指導教授: 曾定章
Din-Chang Tseng
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 資訊工程學系
Department of Computer Science & Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 74
中文關鍵詞: 瑕疵分割語義分割分割U網路自我注意力注意力殘差
外文關鍵詞: defect segmentation, semantic segmentation, segmentation, U-Net, self-attention, attention, residual
相關次數: 點閱:20下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 利用機器視覺檢測產品的瑕疵是一個被廣泛討論許久的問題,特點是能提高生產效率及自動化程度。在一些不適合人工作業的環境中也能代替人眼檢測,除了提高效率,也不會如人眼因長時間使用而產生視覺疲勞,更能在大量重複性的作業中維持較好的效率與品質。
    對於電腦鍵盤之鍵帽瑕疵偵測的議題,我們採取深度學習模式的演算法。深度學習本身是模仿人類神經網路的構造層層搭建。在經過訓練前,裡頭的參數都是隨機初始化的值。需要大量的訓練資料和適當定義的誤差函數作為網路學習的依據,在不斷的迭代訓練中調整參數,最後得到一個能夠依照輸入給出期望推論結果的網路模型。
    本研究目標是以語義分割 (semantic segmentation) 網路模式找出瑕疵區塊,我們採用對稱的U-Net模型進行修改。修改的核心是調整網路編碼與解碼的網路連接方式與卷積層數,並在網路中加入自我注意力 (self-attention) 機制,更進一步提高網路的學習效果。編碼解碼修改的方式是依然固定對稱的結構,使用深層卷積神經網路中經常被使用的殘差 (residual) 區塊,並衡量效率調整卷積層數。自我注意力機制分別運用在網路深層的高階特徵強化和上採樣融合低階特徵兩個地方。對於高階特徵強化的部份,具體而言是在編碼器輸出特徵圖後,運用了兩個自我注意力模組分別對其做了位置與通道的自我注意力強化。融合低階特徵強化則是在解碼器中,高低階特徵合併前先做一次自我注意力強化後才並聯。
    修改後的網路相較於U-Net,除了不會增加太多的硬體成本之外,還能有效提高原本U-Net在細小區塊的物體表面瑕疵上之分割能力。實驗中使用的資料集為同款鍵盤上的659張鍵帽影像。我們將其中的594張和剩餘的65張分別做為訓練集與測試集,並使用資料擴增方法將訓練集的樣本總數提高至4752張。針對模型方面,我們比較了多種不同網路架構及自我注意力機制。由殘差區塊組成之對稱的編碼器和解碼器提升了1%的召回率 (recall);連接在編碼器之後的位置與通道注意力模組分別提升了4%及2%的召回率;於上採樣融合高低階特徵階段加入的全域注意力上採樣提升了2%的召回率。最終版本的瑕疵分割網路結合上述所有部分,在實驗集上獲得85%的MIoU以及召回率。


    The use of methods based on computer vision to detect product defects is an issue that has been widely discussed for a long time. It is characterized by the efficiency of production and the degree of automation. Furthermore, it can also replace human eye detection in some environments that are not suitable for manual work. In addition to improving efficiency, it will not cause visual fatigue as human eye after a long time using. It can maintain better efficiency and quality in a long-time repetitive works.
    To solve the problem of detecting tiny defects on object surface, we adopt the deep learning methods. Deep learning is one kind of algorithm that imitates the structure of neural networks. Before training, parameters in the model are all randomly initialized values. It requires a large amount of training data and a fine-defined loss function to make network learning. Step by step, adjusts the parameters in every iterative training, and finally obtains a model that can give the expected outputs according to the input.
    The goal of this research is to find the defective region on object surface using semantic segmentation. We propose our symmetric model based on U-Net. The main ideas to make encoder and decoder stronger, and add self-attention mechanism to the model to further improve the learning effect. We still keep the model in a symmetric structure, but use the residual blocks as a replacement of continuous convolutional computing without skip connection. More, we adjust the number of convolutional layers to fit the task of detecting tiny defects on object surface. Self-attention mechanism is applied in two part of our model where high-level feature enhancement and the merging of high-level and low-level feature while feature up-sampling. In particular, the procedure of high-level feature enhancement, has been added in the encoder output part. Two self-attention modules have been used to perform position and channel self-attention enhancement respectively. The other part has been added in the up-sampling step of the decoder, the module performing self-attention to enhance the low-level features before concatenating with the high-level feature.
    Compared with U-Net, our model not only does not increase much hardware cost, but also improves the ability to detect defects of small regions on object surface. In experiments, we choose the keycap images as target of our surface defect detection. 594 images are picked for training and 65 images for testing. We compared many different network architectures and self-attention modules. The residual symmetric encoder and decoder enhance 1% of recall. The position and channel attention module raise up recall for 4% and 2% respectively. The global attention upsample increase 2% of recall. Finally, the final version of our defect segmentation network achieve 85% MIoU and recall.

    摘要 i Abstract iii 致謝 v 目錄 vi 圖目錄 viii 表目錄 x 第一章 緒論 1 1.1 研究動機 1 1.2 系統架構 2 1.3 論文特色 4 1.4 論文架構 4 第二章 相關研究 5 2.1 語義分割 5 2.2 全卷積網路 6 2.3 編碼-解碼架構 9 2.4 多重解析度的目標分析模組 10 2.5 注意力機制 12 第三章 自我注意力殘差U網路 14 3.1 U-Net網路架構 14 3.2 編碼器 15 3.3 位置與通道注意力模組 23 3.4 解碼器 36 3.5 損失函數 41 第四章 實驗與結果 43 4.1 實驗設備與開發環境 43 4.2 分割網路的訓練 43 4.3 評估準則 47 4.4 實驗與結果 47 第五章 結論與未來展望 53 參考文獻 54

    [1] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition”, Proc. IEEE, vol.86, no.11, pp.2278-2324, 1998.
    [2] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.3431-3440.
    [3] V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: a deep convolutional encoder-decoder architecture for image segmentation," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.39, no.12, pp.2481-2495, 2017.
    [4] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters - improve semantic segmentation by global convolutional network,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul.21-26, 2017, pp.4353-4361.
    [5] P. Fischer, O. Ronneberger, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation”, Medical Image Computing and Computer-Assisted Intervention MICCAI 2015, vol. 9351, pp.234-241, 2015.
    [6] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a discriminative feature network for semantic segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.1857-1866.
    [7] L. Chen, Y. Zhu, G. Papandreou, and F. Schroff, H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. of the European Conf. on Computer Vision (ECCV), Munich, DE, Sep.8-14, 2018, pp.801-818.
    [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556v6.
    [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.770-778.
    [10] X. Wang, R. Girshick, A. Gupta, and K. Hein, “Non-local neural networks,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.7794-7803.
    [11] J. Hu, L. Shen, and G . Sun, “Squeeze-and-excitation networks,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.7132-7141.
    [12] S. Woo, J. Park, J. Lee, and I. Kweon, “CBAM: convolutional block attention module,” in Proc. of the European Conf. on Computer Vision (ECCV), Munich, DE, Sep.8-14, 2018, pp.3-19.
    [13] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H . Lu, “Dual attention network for scene segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.3146-3154.
    [14] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv:1805.08318 [stat.ML].
    [15] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.40, no.4, pp.834-848, 2018.
    [16] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587.
    [17] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber, “Fast image scanning with deep max-pooling convolutional neural networks,” in Proc. of IEEE Int. Conf. on Image Processing (ICIP), Melbourne, AU, Sep.15-18, 2013, pp.4034-4038.
    [18] G. Papandreou, I. Kokkinos, and P.-A. Savalle, “Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.390-399.
    [19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-Cun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in Proc. of International Conference on Learning Representations Conf. (ICLR), Banff, CA, Apr.14-16, 2014.
    [20] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul.21-26, 2017, pp.6230-6239.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” arXiv:1406.4729v4.
    [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (NIPS), Long Beach, CA, Dec.4-9, 2017, pp.6000-6010.
    [23] B. Jimmy, M. Volodymyr, and K. Koray, “Multiple object recognition with visual attention,” arXiv:1412.7755.
    [24] V. Mnih, N. Heess, and A. Graves et al., “Recurrent models of visual attention,” in Proc. of Neural Information Processing Systems (NIPS), Montreal, CA, Dec.8-13, 2014, pp.2204-2212.
    [25] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple granularity descriptors for fine-grained categorization,” in Proc. of IEEE Conf. on International Conference on Computer Vision (ICCV), Santiago, Chile, Dec.11-18, 2015, pp.2399-2406.
    [26] J. Fu, H. Zheng, and T. Mei, ‘‘Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition,’’ in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, Jul.21-26, 2017, pp. 4476–4484.
    [27] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr, ‘‘Learn to pay attention,’’ in Proc. Int. Conf. Learn. Representations (ICLR), Vancouver, CA, Apr.30-May.3, 2018, pp.1-14.
    [28] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia, “Psanet: point-wise spatial attention network for scene parsing,” in Proc. of the European Conf. on Computer Vision (ECCV), Munich, DE, Sep.8-14, 2018, pp.267-283.
    [29] Y. Yuan and J. Wang, “Ocnet: object context network for scene parsing,” arXiv:1809.00916.
    [30] Y. Du, C. Yuan, B. Li, L. Zhao, Y. Li, and W. Hu, ”Interaction-aware spatio-temporal pyramid attention networks for action classification,” in Proc. of the European Conf. on Computer Vision (ECCV), Amsterdam, Netherlands, Oct.8-16, 2016, pp.388-404.
    [31] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proc. of IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, Jun.18-23, 2018, pp.7151-7160.
    [32] H. Qin, W. Chihao, X. Chunyang, W. Ye, Kuo, and C.-C. Jay, “Semantic segmentation with reverse attention,” in Proc. of the British Machine Vision Conference (BMVC), London, UK, Sep.4-7, 2017, pp.1-13.
    [33] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML Conf. , Lille, France, Jul.7-9, 2015, vol.37, pp.448-456.
    [34] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853.
    [35] X. Xiao, et al., “Weighted res-unet for high-quality retina vessel segmentation,” in Proc. of IEEE Int. Conf. on Information Technology in Medicine and Education (ITME), Hangzhou, PRC, Oct.19-21, 2018, pp. 327-331.
    [36] O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: learning where to look for the pancreas,” arXiv:1804.03999.
    [37] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context network for semantic segmentation,” in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.7519-7528.
    [38] J. Konig, M. D. Jenkins, P. Barrie, M. Mannion, and G. Morison, ‘‘A convolutional neural network for pavement surface crack segmentation using residual connections and attention gating,’’ in Proc. of IEEE Int. Conf. on Image Processing (ICIP), Taipei, ROC, Sep.22-25, 2019, pp. 1460-1464.
    [39] C. Kaul, S. Manandhar, and N. Pears, ‘‘FocusNet: an attention-based fully convolutional network for medical image segmentation,’’ in Proc. IEEE 16th Int. Symposium on Biomedical Imaging (ISBI), Hilton Molino Stucky, Venice, Italy, Apr.8-11, 2019, pp.455-458.
    [40] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv:1805.10180.
    [41] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv:1412.6980.

    QR CODE
    :::