Tesseract is an open source OCR engine that can recognize more than 100 languages (Chinese, English, Korean, Japanese, German, French, etc…), but the ability to write Tesseract opponents is poor.
/ / install Tesseract installed at the same time brew install Tesseract / / --with-training-tools training tool to install Tesseract installed at the same time all languages, the language pack is relatively large, if a long time to install, not installed, according to brew install --all-languages Tesseract / / choose to install Tesseract, and install the brew install --all-languages language training tools and --with-training-tools Tesseract / / install only Tesseract. Do not install the brew install Tesseract training tool
Reference document: http://khalsa.guru/posts/16
2 download language library
According to the demand to choose their own language library, here we have chosen is simplified Chinese so the choice of the library is: chi_sim.traineddata
: to copy the file to the /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata directory.
Libname language table as follows
|Aze_cyrl||Azerbaijani – Cyrilic (Azerbaijan -Cyrilic)|
|Cat||Catalan; Valencian (Fagafaga Taylor Nia Ce; Valencia)|
|Chi_sim||Chinese – Simplified (China – Simplified)|
|Chi_tra||Chinese – Traditional (Chinese – traditional)|
|Dan_frak||Danish – Fraktur (Danish -Fraktur)|
|Deu_frak||German – Fraktur (German -Fraktur)|
|Ell||Greek, Modern (1453-) (Greece, modern (1453-))|
|ENM||English, Middle (1100-1500) (English, Middle East (1100-1500))|
|Equ||Math / equation module detection (mathematical / equation detection module)|
|FRM||French, Middle (ca.1400-1600), France, Middle East (ca.1400-1600)|
|GRC||Greek, Ancient (to 1453) (Greek, ancient (to 1453))|
|Hat||Haitian; Haitian Creole (Haitian; Haiti Kerrey Creole)|
|Ita_old||Italian – Old (Italian – old)|
|Kat_old||Georgian – Old (Georgia – old)|
|KHm||Central Khmer (central khmer)|
|KIR||Kirghiz; Kyrgyz (Kirkiz; Kyrgyzstan)|
|Mal||Malayalam (Himalayan Rahm)|
|MSA||Malay (Ma Laiwen)|
|NLD||Dutch; Flemish (Holland; Flemish)|
|OSD||Orientation and detection module (positioning and script detection module) script|
|Pan||Panjabi; Punjabi (Punjab; Punjab)|
|Pus||Pushto; Pashto (Pashtun; Pashtun)|
|Ron||Romanian; Moldavian; Moldovan; Romania; Moldova;|
|Sin||Sinhala; Sinhalese (Sinhala; Sinhala)|
|Slk_frak||Slovak – Fraktur (Slovakia – Fraktur)|
|Spa||Spanish; Castilian (Spanish; ofcastile)|
|Spa_old||Spanish; Castilian – Old (- old Castilian Spanish;)|
|Srp_latn||Serbian – Latin (Serbia – Latin)|
|TIR||Tigrinya (Teague ray)|
|Uig||Uighur; Uyghur (Uygur; Uygur)|
|Uzb_cyrl||Uzbek – Cyrilic (Uzbekistan – Cyrilic)|
Terminal input command: Tesseract –help
Usage: Tesseract --help --help-psm --version Tesseract | | --list-langs [--tessdata-dir PATH] Tesseract --print-parameters [options [configfile Tesseract imagename|stdin...]...] outputbase|stdout [options...]] OCR options: --tessdata-dir PATH [configfile... Specify the location of tessdata path. --user-words PATH Specify the location of user words file. --user-patterns PATH Specify the location of user patterns file. -l LANG[+LANG] Specify language (s) used for OCR. -c VAR=VALUE Set value for config variables. Multiple -c arguments are allowed. -psm NUM Specify page segmentation mode. NOTE: These options must occur before any configfile. Page segmentation mode S: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. Assume a single uniform block of 5 vertically aligned text. 6 Assume a single uniform block of text. Treat the image as a single 7 text line. 8 Treat the image as a single word. the image as a 9 Treat single word in a circle. the image as a 10 Treat single character. Single options: -h, --help Show this help message. --help-psm Show page segmentation modes. -v, --version Show version information. --li St-langs List available for engine. --print-parameters Tesseract parameters Print to languages Tesseract stdout.
Use the default eng / text library, imgName pictures address, result Tesseract imgName result recognition results
Specify the use of simplified Chinese Tesseract -l / / chi_sim imgName result / Tesseract --list-langs language library view local existence
Specify multiple languages:
The specified / multi language, using -l chi_sim+eng imgName + Tesseract result.
There is a place to pay special attention to the parameters PSM
/ / input command, see PSM Tesseract --help-psm 0 Orientation and script parameter detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. Assume a single uniform block 5 of vertically aligned text. Assume a single uniform block 6 of text. 7 Treat the image as a single text line. Treat the image as a 8 single word. 9 Treat the image as a single word in a circle. Treat the image as a 10 single character. (translation may not be accurate, best original script (0): directional monitoring OSD 1 OSD 2) using automatic paging automatic Page, but do not use OSD or OCR (Optical Character Recognition, optical character recognition) 3 full automatic paging, but did not use OSD (default) to assume a variable size of a text column in the 4. 5 assuming a single block of vertically aligned text. 6 suppose a unified text block. 7 see the image as a single text line. 8 see the image as a single word. 9 see the image as a single word in a circle. 10 image as a single character.
It is important to select a different PSM value depending on the situation, and if the selection is not appropriate, it will lead to the failure of the identification.
Do not set the value of PSM / / 1234.png 1234 -l chi_sim Tesseract command: Tesseract Open Source OCR Engine print v3.04.01 with Leptonica Info in fopenReadFromMemory: work-around: writing to a temp file Empty page Empty page!!!! / / not to set the value of PSM Tesseract 1234.png 1234 -l chi_sim -psm ordered 6 successful recognition: 12341234
4 language training
Prepare in advance:
1.training tools. Brew install –with-training-tools 2.jTessBoxEditor command to install training Tesseract (
tools) at the time of installation.
3 training material
The material prepared here is as follows:
Tesseract hui.png -l chi_sim -psm 10 identification results: Look Tesseract yi.png -l chi_sim -psm 10 identification results: = Yi
Obviously, with recognition of chi_sim Library of Hui Yi of the two words is not very good. In order to identify these two words, we have to train these two words.
1 material synthesis, (multiple material synthesis)
open jTessBoxEditor tools, menu bar: tools-> Merge TIFF, select the picture to be synthesized and saved as: huiyi.fitt.
2 generate box file
Tesseract huiyi.tif Huiyi -l chi_sim command / -psm batch.nochop makebox 10
After the implementation of the huiyi.box file will be generated in a box.
Open with a text editor or xcode:
Look 31371121190 = 5186931061
All 31371121190 to 5186931061
3 generate.Tr file
Tesseract huiyi.tif -psm 10 Huiyi command / nobatch box.train
4 generate unicharset file
Huiyi.box / unicharset_extractor command
Note that the unicharset_extractor command is an integrated command inside the training tools, which means that you do not have a training tools if the runtime does not find the command.
5 to create the font_properties file
font feature files, Tesseract-OCR version 3.01 and above to create a font_properties file before training. The format of the file format is as follows:
Fontname Italic bold fixed serif Fraktur / / font name translation bold serif tilt fixed width font Gothic
In addition to fonts other values are bool, 0 or 1
The content of font_properties here is:
Echo'font 0000 0'> font_properties
Shapeclustering -F -U unicharset font_properties huiyi.tr
Will generate: shapetable file, rename huiyi.shapetable
Mftraining -F -U unicharset -O huiyi.unicharset font_properties huiyi.tr
Will generate: huiyi.unicharset, inttemp, pffmtable file, inttemp, pffmtable renamed: huiyi.inttemp, huiyi.pffmtable
Will generate: normproto file, rename huiyi.normproto
6 get traineddata file
Combine_tessdata huiyi. tessdata files TessdataManager combined / / print Combining Tesseract data files. Offset for type 0 (huiyi.config) is -1 Offset for type 1 (huiyi.unicharset) is 140 Offset for type 2 (huiyi.unicharambigs) is -1 Offset for type 3 (huiyi.inttemp) is 406 Offset for type 4 (huiyi.pffmtable) is 118222 Offset for type 5 (huiyi.normproto) is 118282 Offset for type 6 (huiyi.punc-dawg) is -1 Offset for type 7 (huiyi.word-dawg) is -1 Offset for type 8 (huiyi.number-dawg) is -1 Offset for type 9 (huiyi.freq-dawg) is -1 Offset for type 10 (huiyi.fixed-length-dawgs) is -1 Offset for type 1 1 (huiyi.cube-unicharset) is -1 Offset for type 12 (huiyi.cube-word-dawg) is -1 Offset for type 13 (huiyi.shapetable) is 118708 Offset for type 14 (huiyi.bigram-dawg) is -1 Offset for type 15 (huiyi.unambig-dawg) is -1 Offset for type 16 (huiyi.params-model) is -1 Output huiyi.traineddata created successfully.
Move huiyi.traineddata to the /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata/ directory
CP huiyi.traineddata /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata/
Tesseract hui.png -l Huiyi -psm 10 identification results: Look Tesseract yi.png -l Huiyi -psm 10 identification results: easy to Yi
Conclusion: long time no write, this article is written before, have not been issued, just over the years has been very busy, no time to write. In the new year, I wish you all a promising future. Recently, there is no time to move to the iOS on the Tesseract, before the trial results are not good, mainly to identify the speed is slow, but there is not a very good algorithm for gray image processing.