2018年4月23日 星期一

How to add new patterns to tesseract (create new traineddata)


Environment
jTessBoxEditorFX-2.0-Beta / tesseract 4.0.0-alpha.20180109
download url
https://jaist.dl.sourceforge.net/project/vietocr/jTessBoxEditor/jTessBoxEditorFX-2.0-Beta.zip


The zip file already includes tesseract 4.0, it doesn't download.

Process steps
1. Build the private patterns with paint. The file name must be followed the example, other wise the process might cannot work well. The file name is temp.font.exp0.tif
2.  Using tesseract to build the box file as following command line.
          tesseract temp.font.exp0.tif temp.font.exp0 batch.nochop makebox
3.  Using jTessBoxEditor to adjust the .box file to make sure the mapping between letter and images are correct.
4. build font_properites.txt file for bat file to use.
5. Execute the bat file to build the  temp.traineddata for emug to use.

The following is the content of the bat file.
============START OF BAT===================
echo Run Tesseract for Training..
..\tesseract.exe temp.font.exp0.tif temp.font.exp0 nobatch box.train

echo Compute the Character Set..
..\unicharset_extractor.exe temp.font.exp0.box
..\shapeclustering.exe -F font_properties.txt -U unicharset temp.font.exp0.tr
..\mftraining.exe -F font_properties.txt -U unicharset temp.font.exp0.tr

echo Clustering..
..\cntraining.exe temp.font.exp0.tr

echo Rename Files..
rename normproto temp.normproto
rename inttemp temp.inttemp
rename pffmtable temp.pffmtable
rename shapetable temp.shapetable
rename unicharset temp.unicharset

..\combine_tessdata.exe temp.
==========END OF BAT======================

沒有留言: