| 研究生: |
張博崴 Po-Wei Chang |
|---|---|
| 論文名稱: |
使用階層式全卷積神經網路偵測街景文字 Text Detection in Street View Images with Hierarchical Fully Convolution Neural Networks |
| 指導教授: |
蘇柏齊
Po-Chyi Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 79 |
| 中文關鍵詞: | 文字偵測 、招牌偵測 、街景 、全卷積神經網路 、區域候選網絡 |
| 外文關鍵詞: | text detection, sign detection, street view, fully convolutional network, region proposal network |
| 相關次數: | 點閱:23 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
考量街景圖像中所出現的交通路牌與商家招牌等傳達了重要的影像相關資訊,本研究提出街景影像之招牌/路牌偵測機制,於其中定位文字與圖形區域。研究的挑戰在於街景影像常包含與文字紋理相似的雜亂背景,且畫面中的招牌或路牌可能遭到其他物體遮蔽,天候、光線和拍攝角度等因素亦增加偵測的困難。此外,中文字能夠以垂直和水平方式書寫,因此必須能夠偵測這些不同方向的文字並加以區分。我們所提出的偵測機制分成兩個部分,第一部分定位影像中的路牌及招牌所屬區域,採用全卷積網路(Fully Convolutional Network, FCN)訓練街景路牌及招牌偵測模型,將偵測的招路牌視為感興趣區域(Region of Interest, ROI)。第二部分則於ROI中擷取文字及商標,我們使用區域候選網絡(Region Proposal Network, RPN)訓練文字偵測模型,藉此對影像分別做水平與垂直的文字串偵測,再根據第一部分所偵測的ROI,減少RPN對文字的錯誤偵測。最後我們進行後處理以結合水平及垂直文字串,排除錯誤偵測和處理文字串的複雜交集情形,以文字串長寬比、面積、交集情況、招牌背景顏色等來判定有效的區域。實驗結果顯示本研究能有效的在複雜街景畫面中找出招/路牌並偵測文字與圖案區域,並探討兩種不同架構的深度學習網路在此應用中的使用方式。
Considering that traffic/shop signs appearing in street view images contain important visual information such as locations of scenes, effects of advertising on billboards, and the information of store, etc., a text/graph detection mechanism in street view images is proposed in this research. However, many of these objects in street view images are not easy to extract with a fixed template. In addition, street view images often contain cluttered backgrounds such as buildings or trees, which may block some parts of the signs, complicating the related detection. Weather, light conditions and filming angle may also increase the challenges. Another issue is related to the Chinese writing style as the characters can be written vertically or horizontally. Detecting different directions of text-lines is one of the contributions in this research. The proposed detection mechanism is divided into two parts. A fully convolutional network (FCN) is used to train a detection model for effectively locating the positions of signs in street view images, which will be viewed as the regions of interest. The text-lines and graphs in the sign regions can then be successfully extracted by Region Proposal Network (RPN). Finally, post-processing is applied to distinguish horizontal and vertical text-lines, and eliminate false detections. Experimental results show the feasibility of the proposed scheme, especially when complex street views are investigated.
[1] A. Coates, B. CarFenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D. J. Wu, A. Y. Ng, “Text detection and character recognition in scene images with unsupervised feature learning.” IEEE International Conference on Document Analysis and Recognition, pp. 440–445, 2011.
[2] T. Wang, D. J. Wu, A. Coates, A. Y. Ng, “End-to-end text recognition with convolutional neural network.” IEEE International Conference on Pattern Recognition (ICPR), 2012.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, "Mastering the game of Go with deep neural networks and tree search," Nature,vol. 529(7587), pp.484-489, 2016.
[4] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[5] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
[6] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in IEEE International Conference on Computer Vision (ICCV), 2011.
[7] Y.-F. Pan, X. Hou, and C.-L. Liu, “Hybrid approach to detect and localize texts in natural scene images,” IEEE Trans. Image Processing (TIP), vol. 20, pp. 800–813, 2011.
[8] J. J. Lee, P. H. Lee, S. W. Lee, A. Yuille, C. Koch, “AdaBoost for text detection in natural scene.” IEEE International Conference on Document Analysis and Recognition(ICDAR), pp. 429-434, 2011.
[9] R. Minetto, N. Thomeb, M. Cord, “T-HOG: an effective gradient-based descriptor for single line text regions.” Pattern Recognition, vol.46(3), pp. 1078-1090, 2013.
[10] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions.” IEEE International Conference on Computer Vision(ICCV), 2013.
[11] A. Mishra, K. Alahari, C. V. Jawahar, “Top-down and bottom-up cues for scene text recognition.” IEEE International Conference Computer Vision and Pattern Recognition (CVPR), 2012.
[12] K. Wang, B. Babenko, S. Belongie, “End-to-end scene text recognition.” IEEE International Conference on Computer Vision(ICCV), 2011.
[13] T. Wang, D. J. Wu, A. Coates, A. Y. Ng, “End-to-end text recognition with convolutional neural network.” IEEE International Conference on Pattern Recognition (ICPR), 2012.
[14] N. Dalal, B. Triggs, “Histograms of oriented gradients for human detection.” IEEE International Conference Computer Vision and Pattern Recognition (CVPR), 2005.
[15] B. Epshtein, O. Eyal, W. Yonatan, "Detecting text in natural scenes with stroke width transform." IEEE International Conference Computer Vision and Pattern Recognition (CVPR), 2010.
[16] C. Yao, X. Bai, W. Liu, Y. Ma, Z. Tu, “Detecting texts of arbitrary orientations in natural images.” IEEE International Conference Computer Vision and Pattern Recognition (CVPR), 2012.
[17] W. Huang, Z. Lin, J. Yang, J. Wang, “Text localization in natural images usingstroke feature transform and text covariance descriptors.” IEEE International Conference on Computer Vision (ICCV), 2013.
[18] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and vision computing (IVC), vol. 22, pp. 761–767, 2004.
[19] L. Neumann, K. Matas, “Text localization in real-world images using eficiently pruned exhaustive search.” IEEE International Conference on Document Analysis and Recognition (ICDAR), 2011.
[20] L. Neumann, K. Matas, “Real-time scene text localization and recognition.” IEEE International Conference Computer Vision and Pattern Recognition (CVPR), 2012.
[21] W. Huang, Q. Yu, X. Tang, "Robust scene text detection with convolution neural network induced mser trees." European Conference on Computer Vision (ECCV), 2014.
[22] W. Huang, Z. Lin, J. Yang, and J. Wang, “Text localization in natural images using stroke feature transform and text covariance descriptors,” in IEEE International Conference on Computer Vision (ICCV), 2013.
[23] C. L. Zitnick and P. Dolla´r, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision (ECCV), 2014.
[24] L. Sun, Q. Huo, W. Jia, and K. Chen, “A robust approach for text detection from natural scene images,” Pattern Recognition, vol. 48, pp. 2906–2920, 2015.
[25] He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural networks for scene text detection. IEEE Trans. Image Processing (TIP) 25, 2529–2541, 2016.
[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
[28] K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition", 2015.
[29] T. He, W. Huang, Y. Qiao, J. Yao, "Accurate text localization in natural image with cascaded convolutional text network" in , Mar. 2016.
[30] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1–20, 2016.
[31] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[32] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, 2016.
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
[34] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.
[35] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5), 602–610, 2005.
[36] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. arXiv:1612.01105, 2016.
[37] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ADE20K dataset. arXiv:1608.05442, 2016.
[38] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, "Feature pyramid networks for object detection", CVPR, 2017.