| 研究生: |
董怡廷 Leslie Tong |
|---|---|
| 論文名稱: |
繁體中文場景資料集建置暨文字定位與辨識之評估 Designs of the Traditional Chinese Scene Text Dataset and Performance Evaluation for Text Detection and Recognition |
| 指導教授: |
蘇柏齊
Po-Chyi Su |
| 口試委員: | |
| 學位類別: |
碩士 Master |
| 系所名稱: |
資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 79 |
| 中文關鍵詞: | 深度學習 、場景文字資料集 、文字定位 、文字辨識 |
| 外文關鍵詞: | Deep learning, scene text dataset, text detection, text recognition |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
場景文字包含非常豐富的影像相關訊息,擷取並辨識畫面中的文字內容能夠促成許多具潛力的應用,因此場景文字分析目前為電腦視覺領域所關注的研究議題之一。然而,現有場景文字資料集或相關競賽多集中於英文或其他語言的處理,台灣所使用的繁體中文尚未有較完整的資料。為了促進繁體中文字辨識領域的發展,本研究蒐集大量繁體中文街景圖片,包含20,188張街景影像,經後處理與標記後整合為繁體中文場景文字資料集。由於中文字的走向、大小、字體相當多元,為了讓標記資料趨於一致,我們訂定較符合包含中文場景文字的標記原則,其中的字串與字元都帶有位置與內容,並加上語言種類。資料集經錯誤檢查與整理後,應用於日前所舉辦的繁體中文場景文字辨識競賽。此競賽共分成三項任務,初階賽-文字定位、進階賽-繁體中文字元辨識,以及高階賽-複雜街景之中英數字辨識。本論文針對各階段競賽訂定評分原則,並展示競賽最終結果。比賽於2021年4月開始,2021年12月結束。每項競賽的參賽隊伍數與提交次數分別為,初階賽341組246次有效提交; 進階賽183組60次有效提交; 高階賽128組91次有效提交。
Texts in pictures contain rich information. Extracting and recognizing these texts in images, i.e., scene text detection and recognition, help to facilitate many interesting and potential applications. Therefore, scene text analytics have become one of the research topics in the filed of computer vision. Nevertheless, most existing datasets and competitions related to scene text detection and recognition focused on English or other languages. The Traditional Chinese used in Taiwan has not been paid too much attention in this field. In order to promote the research of Traditional Chinese scene text analytics, in this study, we collected a large volume of street-view images to form the dataset called "Traditional Chinese Street-View Texts" (TCSVT), containing $20,188$ images with careful annotations. The characters in this dataset have various forms and the strings have varying orientations, sizes, and fonts. We formulated a set of labeling principles for texts containing Chinese so that the annotations can be more standardized. The labels of text lines and characters include their locations, contents and the language types. This dataset was then adopted in the 2021 AICUP Traditional Chinese Scene Text Recognition Competition. This competition has three stages: 1) Text-line Localization, 2) Traditional Chinese Text-line Recognition and 3) Text Spotting and Recognition in Complex Streetscapes. We set up reasonable evaluation metrics of each task. The competition started in April 2021 and ended in December 2021. The numbers of teams partipating the three stages are $341$, $183$ and $128$, repectively. The numbers of valid submissions of the three tasks are $246$, $60$ and $91$ respectively.
[1] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis
Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph
Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual
scene text detection and script identification-rrc-mlt. In 2017 14th IAPR
International Conference on Document Analysis and Recognition (ICDAR),
volume 1, pages 1454–1459. IEEE, 2017.
[2] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo,
Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding,
et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art.
In 2019 International Conference on Document Analysis and Recognition
(ICDAR), pages 1571–1576. IEEE, 2019.
[3] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. Detecting
curve text in the wild: New dataset and new solution. arXiv preprint
arXiv:1712.02170, 2017.
[4] Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, Linyan
Cui, Serge Belongie, Shijian Lu, and Xiang Bai. Icdar2017 competition
on reading chinese text in the wild (rctw-17). In 2017 14th IAPR Inter-national Conference on Document Analysis and Recognition (ICDAR),volume 1, pages 1429–1434. IEEE, 2017.
[5] Yipeng Sun, Zihan Ni, Chee-Kheng Chng, Yuliang Liu, Canjie Luo,Chun Chet Ng, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE, 2019.
[6] Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE, 2019.
[7] Yuliang Liu, Lianwen Jin, Zecheng Xie, Canjie Luo, Shuaitao Zhang, and Lele Xie. Tightness-aware evaluation protocol for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9612–9620, 2019.
[8] Yuliang Liu, Sheng Zhang, Lianwen Jin, Lele Xie, Yaqiang Wu, and Zhepeng Wang. Omnidirectional scene text detection with sequentialfree box discretization. arXiv preprint arXiv:1906.02371, 2019.
[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[10] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Pro-ceedings of the IEEE conference on computer vision and pattern recognition,pages 1492–1500, 2017.
[11] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016.
[12] Mingxing Tan and Quoc V Le. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298, 2021.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv:2010.11929, 2020.
[14] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. Highperformance
large-scale image recognition without normalization. arXiv
preprint arXiv:2102.06171, 2021.
[15] Yuliang Liu, Tong He, Hao Chen, Xinyu Wang, Canjie Luo, Shuaitao Zhang, Chunhua Shen, and Lianwen Jin. Exploring the capacity of an orderless box discretization network for multi-orientation scene text detection. International Journal of Computer Vision, 129(6):1972–1992, 2021.
[16] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Robust scene text recognition with automatic rectification. In Pro-ceedings of the IEEE conference on computer vision and pattern recognition, pages 4168–4176, 2016.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[18] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision, pages 5076–5084, 2017.
[19] Jinjin Zhang, Wei Wang, Di Huang, Qingjie Liu, and Yunhong Wang. A feasible framework for arbitrary-shaped scene text recognition. arXiv preprint arXiv:1912.04561, 2019.
[20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[21] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
[22] Ruijie Yan, Liangrui Peng, Shanyu Xiao, and Gang Yao. Primitive representation learning for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2021.
[23] Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020.
[24] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4715–4723, 2019.
[25] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Realtime scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11474–11481, 2020.
[26] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2015.
[28] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.