跳到主要內容

簡易檢索 / 詳目顯示

研究生: 蘇冠宇
Kung-Yu Su
論文名稱: 基於注意力殘差網路之繁體中文街景文字辨識
Traditional Chinese Scene Text Recognition based on Attention-Residual Network
指導教授: 蘇柏齊
Po-Chyi Su
口試委員:
學位類別: 碩士
Master
系所名稱: 資訊電機學院 - 軟體工程研究所
Graduate Institute of Software Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 76
中文關鍵詞: 電腦視覺深度學習街景文字偵測繁體中文字辨識
相關次數: 點閱:18下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 街景招牌文字經常傳達豐富的資訊,若能經由視覺技術辨識這些影像中的文字將有利於許多相關應用的開發。儘管電腦視覺於光學文本辨識已有相當成熟的技術,但自然場景文字辨識仍是非常具有挑戰性的任務。除了更多樣的字體、文字大小、與使用者拍攝角度等因素外,繁體中文字訓練資料目前仍不多見,眾多中文字也很難平均地蒐集相對應的照片,即使蒐集了足夠資料也會面臨數據不平衡問題。因此,本研究使用數種繁體中文字體產生高品質訓練影像及標記資料,模擬街景上複雜的文字變化,同時避免人工標記可能造成的誤差。除此之外,本文中亦探討如何使人工生成繁體文字影像更貼近街景真實文字,透過調整光線明亮度、幾何轉換、增加外框輪廓等方式產生多樣化訓練資料以增強模型的可靠性。對於文字偵測及辨識,我們採用兩階段演算法。首先我們採用Deep Lab模型以語意分割方式偵測街景中的單字與文本行所在區域,接著使用STN (Spatial Transformer Network) 修正偵測階段所框列的傾斜文字以利後續辨識階段的特徵提取。我們改良了ResNet50 模型,透過注意力機制改善模型在大型分類任務中的準確率。最後,我們透過使用者的GPS資訊與Google Place API中的地點資訊進行交叉比對,藉此驗證與修正模型輸出文字,增強街景文字的辨識能力。實驗結果顯示本研究能有效偵測及辨識繁體中文街景文字,並在複雜街景測試下表現優於Line OCR及Google Vision。


    Texts in nature scenes, especially street views, usually contain rich information related to the images. Although recognition of scanned documents has been well studied, scene text recognition is still a challenging task due to variable text fonts, inconsistent lighting conditions, different text orientations, background noises, angle of camera shooting and possible image distortions. This research aims at developing effective Traditional Chinese recognition scheme for streetscape based on deep learning techniques. It should be noted that constructing a suitable training dataset is an essential step and will affect the recognition performance significantly. However, the large alphabet size of Chinese characters is certainly an issue, which may cause the so-called data imbalance problem when collecting corresponding images. In the proposed scheme, a synthetic dataset with automatic labeling is constructed using several fonts and data augmentation. In an investigated image, the potential regions of characters and text-lines are located. For the located single characters, the possibly skewed images are rectified by the spatial transform network to enhance the performance. Next, the proposed attention-residual network improves the recognition accuracy in this large-scale classification. Finally, the recognized characters are combined using detected text-lines and corrected by the information from Google Place API with the location information. The experimental results show that the proposed scheme can correctly extract the texts from the selected areas in investigated images. The recognition performance is superior to Line OCR and Google Vision in complex street scenes.

    目錄 論文摘要 IV Abstract V 目錄 VI 附圖目錄 VIII 附表目錄 X 第一章 緒論 1 1.1 研究動機 1 1.2 研究貢獻 2 1.3 論文架構 3 第二章 相關研究 4 2.1 深度學習相關網路介紹 4 2.2 深度學習應用於文字偵測與辨識 8 2.3 人工合成訓練集 12 第三章 提出方法 14 3.1 街景文字偵測網路 16 3.1.1 DeepLab V3+[25] 16 3.1.2 偵測網路訓練資料 24 3.1.3 文本行偵測網路及單字偵測網路偵測結果 26 3.2 繁體中文街景辨識網路 27 3.2.1 類別選擇 27 3.2.2 製作人工底圖 27 3.2.3 數據增強 28 3.2.4 網路設計 34 3.2.5 實作細節 40 第四章 實驗結果 43 4.1 開發環境 43 4.2 卷積網路設定 43 4.3 真實街景測試集 44 4.3.1 真實測試集實驗結果 46 4.4 與其他商用軟體比較 47 4.4.1 藝術字體 47 4.4.2 傾斜情況 49 4.4.3 遮蔽情況 51 4.4.4 複雜街景 54 4.4.5 室內傾斜情況 56 第五章 結論與未來展望 59 5.1 結論 59 5.2 未來展望 59 參考文獻 60

    [1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks”, In Advances in neural information processing systems, 2012.
    [2] Simonyan, Karen and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”, In International Conference on Learning Representations, 2015
    [3] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the Inception Architecture for Computer Vision”, In IEEE conference on computer vision and pattern recognition, 2016.
    [4] Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, In International Conference on Machine Learning, 2015.
    [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”, In IEEE conference on computer vision and pattern recognition, 2016.
    [6] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger “Densely Connected Convolutional Networks “, In IEEE conference on computer vision and pattern recognition, 2017.
    [7] Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”, In IEEE conference on computer vision and pattern recognition, 2014.
    [8] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. “Selective search for object recognition”, International journal of computer vision, 2013.

    [9] Suykens, Johan AK, and Joos Vandewalle. “Least squares support vector machine classifiers”, In International Conference on Machine Learning, 1998.
    [10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks”, In Advances in neural information processing systems, 2015.
    [11] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified, real-time object detection”, In IEEE conference on computer vision and pattern recognition, 2016.
    [12] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”, In IEEE conference on computer vision and pattern recognition, 2015.
    [13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn.”, In Proceedings of the IEEE international conference on computer vision, 2017.
    [14] Zhi Tian, Weilin Huang, Tong He, Pan He and Yu Qiao, “Detecting Text in Natural Image with Connectionist Text Proposal Network”, In European Conference on Computer Vision, 2016.
    [15] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory”, In Neural Computation, 1997.
    [16] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu and Xiang Bai, “Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes”, In European Conference on Computer Vision, 2018.
    [17] Baoguang Shi, Xiang Bai, and Cong Yao. “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.

    [18] Alex Graves, Santiago Fernández, Faustino Gomez and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, In International Conference on Machine Learning, 2006.
    [19] Fenfen Sheng; Zhineng Chen and Bo Xu, “NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition”, In International Conference on Document Analysis and Recognition, 2019.
    [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin, “Attention is All you Need”, In Advances in neural information processing systems, 2017.
    [21] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh and Hwalsuk Lee, “What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis”, In Proceedings of the IEEE international conference on computer vision, 2019.
    [21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi and Andrew Zisserman, “Reading Text in the Wild with Convolutional Neural Networks”, International journal of computer vision, 2016.
    [22] Ankush Gupta, Andrea Vedaldi and Andrew Zisserman, “SynthText in the Wild Dataset”, In IEEE conference on computer vision and pattern recognition, 2016.
    [23] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun and Hwalsuk Lee, “Character Region Awareness for Text Detection”, In IEEE conference on computer vision and pattern recognition, 2019.
    [24] Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li and Shi-Min Hu, “Chinese Text in the Wild”, In IEEE conference on computer vision and pattern recognition, 2018.

    [25] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff and Hartwig Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”, In European Conference on Computer Vision, 2018.
    [26] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions”, In International Conference on Learning Representations, 2016.
    [27] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy and Alan L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
    [28] 認識中文字元碼 http://idv.sinica.edu.tw/bear/charcodes/Section05.htm
    [29] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks”, In Advances in neural information processing systems, 2015.
    [30] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”, In IEEE conference on computer vision and pattern recognition, 2018.
    [31] Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks”, In International Conference on Artificial Intelligence and Statistics, 2010.
    [32] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization”, In International Conference on Learning Representations, 2015.
    [33] Lee, Junyeop, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim and Hwalsuk Lee. “On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention”, In IEEE conference on computer vision and pattern recognition workshops, 2020.

    [34] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. “Robust scene text recognition with automatic rectification”, In IEEE conference on computer vision and pattern recognition, 2016.
    [35] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. “Aster: An attentional scene text recognizer with flexible rectification”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
    [36] Wei Liu, Chaofeng Chen, and Kwan-Yee K. Wong. “Charnet: A character-aware neural network for distorted scene text recognition”, In AAAI Conference on Artificial Intelligence, 2018.
    [37] Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. “Star-net: A spatial attention residue network for scene text recognition”, In British Machine Vision Conference, 2016.
    [38] Yunze Gao, Yingying Chen, Jinqiao Wang, Zhen Lei, XiaoYu Zhang, and Hanqing Lu. “Recurrent calibration network for irregular text recognition”, In IEEE conference on computer vision and pattern recognition, 2018.
    [39] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. “AON: Towards arbitrarily-oriented text recognition”, In IEEE conference on computer vision and pattern recognition, 2018.
    [40] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. “Show, attend and read: A simple and strong baseline for irregular text recognition”, In AAAI Conference on Artificial Intelligence, 2019.
    [41] Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C Lee Giles. “Learning to read irregular text with attention mechanisms”, In International Joint Conferences on Artificial Intelligence, 2017. 
    [42] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel and Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, In International Conference on Machine Learning, 2015.
    [43] Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate”, In International Conference on Learning Representations 2015.
    [44] Minh-Thang Luong, Hieu Pham, Christopher D. Manning, “Effective Approaches to Attention-based Neural Machine Translation”, In Empirical Methods in Natural Language Processing, 2015.
    [45] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong and R. Young, “ICDAR 2003 Robust Reading Competitions”, In International Conference on Document Analysis and Recognition, 2003.
    [46] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazàn Almazàn and Lluís Pere de las Heras, “ICDAR 2013 Robust Reading Competition”, In International Conference on Document Analysis and Recognition, 2013.
    [47] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida and Ernest Valveny, “ICDAR 2015 competition on Robust Reading”, In International Conference on Document Analysis and Recognition, 2015.
    [48] Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie and Dimosthenis Karatzas, “ICDAR2017 Robust Reading Challenge on COCO-Text”, In International Conference on Document Analysis and Recognition, 2017.

    QR CODE
    :::