2018年4月23日 星期一

How to add new patterns to tesseract (create new traineddata)


Environment
jTessBoxEditorFX-2.0-Beta / tesseract 4.0.0-alpha.20180109
download url
https://jaist.dl.sourceforge.net/project/vietocr/jTessBoxEditor/jTessBoxEditorFX-2.0-Beta.zip


The zip file already includes tesseract 4.0, it doesn't download.

Process steps
1. Build the private patterns with paint. The file name must be followed the example, other wise the process might cannot work well. The file name is temp.font.exp0.tif
2.  Using tesseract to build the box file as following command line.
          tesseract temp.font.exp0.tif temp.font.exp0 batch.nochop makebox
3.  Using jTessBoxEditor to adjust the .box file to make sure the mapping between letter and images are correct.
4. build font_properites.txt file for bat file to use.
5. Execute the bat file to build the  temp.traineddata for emug to use.

The following is the content of the bat file.
============START OF BAT===================
echo Run Tesseract for Training..
..\tesseract.exe temp.font.exp0.tif temp.font.exp0 nobatch box.train

echo Compute the Character Set..
..\unicharset_extractor.exe temp.font.exp0.box
..\shapeclustering.exe -F font_properties.txt -U unicharset temp.font.exp0.tr
..\mftraining.exe -F font_properties.txt -U unicharset temp.font.exp0.tr

echo Clustering..
..\cntraining.exe temp.font.exp0.tr

echo Rename Files..
rename normproto temp.normproto
rename inttemp temp.inttemp
rename pffmtable temp.pffmtable
rename shapetable temp.shapetable
rename unicharset temp.unicharset

..\combine_tessdata.exe temp.
==========END OF BAT======================

2018年4月22日 星期日

如何加入自己的模版 How to add the patterns to Tesseract OCR

環境
jTessBoxEditorFX-2.0-Beta / tesseract 4.0.0-alpha.20180109

下載連結
https://jaist.dl.sourceforge.net/project/vietocr/jTessBoxEditor/jTessBoxEditorFX-2.0-Beta.zip

連結下載的壓縮檔已內含 tesseract 的執行環境,不須另外下載。

操作步驟
1. 建立模版圖檔,檔名的格式似乎有特別的結構需要遵從。檔名取名為temp.font.exp0.tif 如果沒有把握能正確完成,請使用這個檔名。
2. 使用tesseract 將模版圖檔處理後產生 temp.font.exp0.box 檔。產生.box  檔案的命令為
 
          tesseract temp.font.exp0.tif temp.font.exp0 batch.nochop makebox

3. 使用jTessBoxEditor 調整 .box 檔案 以確保模版與字元的對應是正確的。
4. 建立font_properties.txt 檔案以供後需.bat 檔案使用
5. 執行以下.bat 檔案中的命令,以產生最後emgu可以使用的 temp.traineddata.

以下是.bat file 的內容
The following is the content of the bat file.
============START OF BAT===================
echo Run Tesseract for Training..
..\tesseract.exe temp.font.exp0.tif temp.font.exp0 nobatch box.train

echo Compute the Character Set..
..\unicharset_extractor.exe temp.font.exp0.box
..\shapeclustering.exe -F font_properties.txt -U unicharset temp.font.exp0.tr
..\mftraining.exe -F font_properties.txt -U unicharset temp.font.exp0.tr

echo Clustering..
..\cntraining.exe temp.font.exp0.tr

echo Rename Files..
rename normproto temp.normproto
rename inttemp temp.inttemp
rename pffmtable temp.pffmtable
rename shapetable temp.shapetable
rename unicharset temp.unicharset

..\combine_tessdata.exe temp.
==========END OF BAT======================