評估與改進Tesseract運用於彩色網頁的光學字元辨識

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳奕誠 Yi-Cheng Chen
論文名稱：	評估與改進Tesseract運用於彩色網頁的光學字元辨識
指導教授：	鄭永斌 Yong-Bin Zheng
口試委員:
學位類別：	碩士 Master
系所名稱：	資訊電機學院 - 資訊工程學系 Department of Computer Science & Information Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	55
中文關鍵詞：	光學字元辨識
外文關鍵詞：	Tesseract
相關次數：	點閱：6 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在過去，光學字元辨識的成敗往往跟特徵值的擷取有著密不可分的關係，假如沒能有效的提取出重要的特徵，其辨識結果必然不如預期。而隨著硬體設備及運算能力的提升，讓深度學習成了近年來的熱門領域，它的強大在於自動抽取特徵的能力，理論上能夠有效的尋找出好的特徵來提升光學字元識別的辨識能力。
根據IBM估計，全世界一年花費約 2 兆 5 千萬美元在將儲存於傳統媒體之非數位化文件，以人工鍵入的方式轉化為數位化文件，若能夠提高光學字元辨識的辨識率到達可用的標準，就可以大幅省下時間且降低成本。如今未能有辨識率達到100%的工具，原因是辨識圖像的來源有多種不同的情況，例如掃描文件和相機拍照的雜訊、複雜的排版、文字和背景的顏色、大大小小的圖標、不同的語言以及字體，都會大大的影響辨識結果。
本研究之目的在於尋找一個有效提升OCR軟體辨識率的方法。辨識所使用的圖像為網頁截圖，即沒有雜訊以及矯正過後的影像。由於電腦字體為True Type Font，即使相同頁面在不同的螢幕上截圖都有可能不同。在測試當中，Google Vision的辨識率是最好的，但Google Vision是一個cloud service，由於許多工廠的機台只允許使用內網，並不能對外連網，因此選用open source的Tesseract 4.0。實驗中發現，若直接使用Tesseract 4.0來對彩色的網頁進行辨識，它的辨識率非常低，但經過前處理後，辨識率就能大幅的提升。另外針對每一個頁面進行個別訓練，並無法有效的提升辨識率，原因是網頁中的內容排版複雜，且字型的大小不固定，由於Tesseract 4.0基於LSTM，若遇到大小不同的文字被判斷為同一行，都會影響它的辨識結果。

In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware devices and computing power have made deep learning become a hot field in recent years due to its ability to automatically extract features and effectiveness to find good features to enhance the recognition ability of optical character recognition.
According to IBM estimates, about $2.5 trillion a year has been spent on storing non-digital files by converting them into digital files by manual typing. If it is possible to improve the recognition rate of OCR to certain acceptable standard, then it can save time and reduce costs. Besides, there aren’t any tools with a recognition rate of 100% today because there are many different sources of identification images, such as scanned files, camera photo noise, complex typography, text and background colors, large and small icons, different languages and fonts that will greatly affect the recognition results.
The purpose of this paper is to find a way to effectively improve the OCR software recognition rate. We used screenshots of webpages that have better corrected images and don’t have noise. The computer font is True Type Font, so the screenshots may be different even if the same page is on different screens. The result of testing indicates Google Vision, a cloud service, has better recognition rate than other software. However, many factories that demand OCR don’t connect to the Internet, so we choose Tesseract 4.0 which is an open source. The findings of this paper show that with its low recognition rate, the pre-processing of Tesseract 4.0 has better improved its recognition rate than its training. The poor result of its training is mainly caused by complex typography and different text sizes.

摘  要 i
Abstract ii
圖目錄 vi
表目錄 ix
一、緒論 1
1-1 研究背景 1
1-2 OCR的應用 2
1-3 TrueType 3
1-4 研究動機 3
1-4 研究方法與貢獻 4
1-5 論文架構 5
二、研究背景與相關研究 6
2-1 OCR主流 6
2-1-1 卷積神經網路 7
2-1-2 循環神經網絡 11
2-1-3 長短期記憶網路 13
2-1-4 聯結時序分類 14
2-2 Tesseract-OCR 14
2-2-1 Tesseract 4.0 訓練 15
2-3 Google Vision 16
三、影像前處理 17
3-1 大津二值化法 18
3-1-1 大津二值化法評估 19
3-2 TextFiller 20
3-2-1 TextFiller1 20
3-2-2 TextFiller2 21
3-2-3 TextFiller評估 22
四、圖像字元辨識 23
4-1 Google登入頁面辨識（版面單純、乾淨） 24
4-1-1 原圖辨識 24
4-1-2 Otsu二值化辨識 24
4-1-3 TextFiller1 二值化辨識 25
4-1-4 TextFiller2 二值化辨識 25
4-2 Twitter登入頁面辨識（版面單純） 26
4-2-1 原圖辨識 26
4-2-2 Otsu二值化辨識 26
4-2-3 TextFiller1 二值化辨識 27
4-2-4 TextFiller2 二值化辨識 27
4-3 Stackoverflow登入頁面辨識（版面複雜） 28
4-3-1 原圖辨識 28
4-3-2 Otsu二值化辨識 28
4-3-3 TextFiller1 二值化辨識 29
4-3-4 TextFiller2 二值化辨識 29
4-4 Facebook登入頁面辨識（文字多變化、多種語言） 30
4-4-1 原圖辨識 30
4-4-2 Otsu二值化辨識 30
4-4-3 TextFiller1 二值化辨識 31
4-4-4 TextFiller2 二值化辨識 31
4-5 Yahoo登入頁面辨識（背景髒亂） 32
4-5-1 原圖辨識 32
4-5-2 Otsu二值化辨識 32
4-5-3 TextFiller1 二值化辨識 33
4-5-4 TextFiller2 二值化辨識 33
五、辨識結果分析 34
5-1 Tesseract 缺點分析 34
5-1-1 文字容易偵測失敗 34
5-1-2 不易辨識淺色文字深色背景 35
5-1-3 True Type造成辨識失敗 35
5-1-4 將辨識結果變成小寫 35
5-1-5 將背景和圖標當成文字 36
5-2 Tesseract訓練 37
5-3 辨識率降低原因 38
六、結論 40

                                

[1] “文字識別OCR發展簡史,” 09 08 2017. [線上]. Available: https://kknews.cc/zh-tw/tech/9jvlbjl.html. [存取日期: 01 05 2019].
[2] “Template matching,” [線上]. Available: https://en.wikipedia.org/wiki/Template_matching. [存取日期: 4 6 2019].
[3] 簡嘉慶, “Very High Precision Optical Character Recognition,” 2017.
[4] “Wiki-TrueType,” 14 12 2018. [線上]. Available: https://zh.wikipedia.org/wiki/TrueType. [存取日期: 01 05 2019].
[5] J. K. A. G. R. T. Kasar, “Font and Background Color Independent Text Binarization,” Proc. 2nd Int. Workshop Camera-Based Document Anal. Recognit., pp. 3-9, 2007.
[6] “Wiki-光學字元辨識,” 27 02 2019. [線上]. Available: https://zh.wikipedia.org/wiki/%E5%85%89%E5%AD%A6%E5%AD%97%E7%AC%A6%E8%AF%86%E5%88%AB. [存取日期: 01 05 2019].
[7] J. Chung, “Handwriting OCR,” 5 9 2018. [線上]. Available: https://medium.com/apache-mxnet/handwriting-ocr-handwriting-recognition-and-language-modeling-with-mxnet-gluon-4c7165788c67. [存取日期: 3 6 20219].
[8] “卷積神經網路,” 26 05 2018. [線上]. Available: https://medium.com/@chih.sheng.huang821/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-convolutional-neural-network-cnn-%E5%8D%B7%E7%A9%8D%E9%81%8B%E7%AE%97-%E6%B1%A0%E5%8C%96%E9%81%8B%E7%AE%97-856330c2b703. [存取日期: 01 05 2019].
[9] “循環神經網路,” 24 12 2017. [線上]. Available: https://ithelp.ithome.com.tw/articles/10193469. [存取日期: 01 05 2019].
[10] “理解RNN、LSTM、GRU和Gradient Vanishing,” 2 3 2018. [線上]. Available: https://blog.csdn.net/qq_28031525/article/details/79423450. [存取日期: 12 6 2019].
[11] A. M. Schaefer, “Learning Long Term Dependencies with Recurrent Neural Networks,” 2006.
[12] “長短期記憶網路,” 21 03 2016. [線上]. Available: https://www.yunaitong.cn/understanding-lstm-networks.html. [存取日期: 01 05 2019].
[13] “Connectionist temporal classification,” 21 2 2019. [線上]. Available: https://en.wikipedia.org/wiki/Connectionist_temporal_classification. [存取日期: 12 6 2019].
[14] “Tesseract 4.0,” [線上]. Available: https://github.com/tesseract-ocr/tesseract. [存取日期: 1 5 2019].
[15] R. Smith, “An Overview of the Tesseract OCR Engine,” Proc. International Conference on Document Analysis and Recognition, 2007.
[16] “TrainingTesseract 4.0,” 4 8 2017. [線上]. Available: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#introduction. [存取日期: 13 6 2019].
[17] “Google雲端視覺分析服務,” 19 02 2016. [線上]. Available: https://www.ithome.com.tw/news/103995. [存取日期: 01 05 2019].
[18] “Otsu's method,” 17 5 2019. [線上]. Available: https://en.wikipedia.org/wiki/Otsu%27s_method. [存取日期: 12 6 2019].

簡易檢索 / 詳目顯示

相關論文