Mac text recognition (Tesseract-OCR for MAC)

0 Introduction

Tesseract is an open source OCR engine that can recognize more than 100 languages (Chinese, English, Korean, Japanese, German, French, etc…), but the ability to write Tesseract opponents is poor.

1 installation

/ / install Tesseract installed at the same time brew install Tesseract / / --with-training-tools training tool to install Tesseract installed at the same time all languages, the language pack is relatively large, if a long time to install, not installed, according to brew install --all-languages Tesseract / / choose to install Tesseract, and install the brew install --all-languages language training tools and --with-training-tools Tesseract / / install only Tesseract. Do not install the brew install Tesseract training tool

Reference document: http://khalsa.guru/posts/16

2 download language library

Download: https://github.com/tesseract-ocr/tessdata

According to the demand to choose their own language library, here we have chosen is simplified Chinese so the choice of the library is: chi_sim.traineddata
: to copy the file to the /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata directory.

Libname language table as follows

Libname language
AFR Afrikaans (Afrikaans)
AMH Amharic (Amharic)
Ara Arabic (Arabic)
ASM Assamese (Asam)
Aze Azerbaijani (Azerbaijan)
Aze_cyrl Azerbaijani – Cyrilic (Azerbaijan -Cyrilic)
Bel Belarusian (Belarus)
Ben Bengali (Bangladesh)
BOD Tibetan (Tibet)
BOS Bosnian (Bosnia)
Bul Bulgarian (Bulgarian)
Cat Catalan; Valencian (Fagafaga Taylor Nia Ce; Valencia)
CEB Cebuano (Cebu)
CES Czech (Czech)
Chi_sim Chinese – Simplified (China – Simplified)
Chi_tra Chinese – Traditional (Chinese – traditional)
Chr Cherokee (Cherokee)
CYM Welsh (Welsh)
Dan Danish (Denmark)
Dan_frak Danish – Fraktur (Danish -Fraktur)
DEU German (Germany)
Deu_frak German – Fraktur (German -Fraktur)
Dzo Dzongkha (not)
Ell Greek, Modern (1453-) (Greece, modern (1453-))
Eng English (English)
ENM English, Middle (1100-1500) (English, Middle East (1100-1500))
EPO Esperanto (Esperanto)
Equ Math / equation module detection (mathematical / equation detection module)
Est Estonian (Estonia)
EUS Basque (Basque)
Fas Persian (Persia)
Fin Finnish (Finland)
Fra French (French)
Frk Frankish (Frank)
FRM French, Middle (ca.1400-1600), France, Middle East (ca.1400-1600)
Gle Irish (Ireland)
GLG Galician (Galicia)
GRC Greek, Ancient (to 1453) (Greek, ancient (to 1453))
Guj Gujarati (Gujarati)
Hat Haitian; Haitian Creole (Haitian; Haiti Kerrey Creole)
Heb Hebrew (Hebrew)
Hin Hindi (Hindi)
HRV Croatian (Croatia)
Hun Hungarian (Hungary)
Iku Inuktitut (Inuktun)
Ind Indonesian (Indonesia)
Isl Icelandic (Iceland)
Ita Italian (Italian)
Ita_old Italian – Old (Italian – old)
JAV Javanese (Java)
JPN Japanese (Japan)
Kan Kannada (Canada)
Kat Georgian (Georgia)
Kat_old Georgian – Old (Georgia – old)
Kaz Kazakh (Kazakhstan)
KHm Central Khmer (central khmer)
KIR Kirghiz; Kyrgyz (Kirkiz; Kyrgyzstan)
Kor Korean (Korea)
Kur Kurdish (Kurdish)
Lao Lao (laos)
Lat Latin (Latin)
Lav Latvian (Latvia)
Lit Lithuanian (Lithuania)
Mal Malayalam (Himalayan Rahm)
Mar Marathi (Mara)
MKD Macedonian (Macedonia)
MLT Maltese (Malta)
MSA Malay (Ma Laiwen)
Mya Burmese (Burma)
NEP Nepali (Nepal)
NLD Dutch; Flemish (Holland; Flemish)
Nor Norwegian (Norway)
Ori Oriya (Oriya)
OSD Orientation and detection module (positioning and script detection module) script
Pan Panjabi; Punjabi (Punjab; Punjab)
Pol Polish (Poland)
Por Portuguese (Portuguese)
Pus Pushto; Pashto (Pashtun; Pashtun)
Ron Romanian; Moldavian; Moldovan; Romania; Moldova;
Rus Russian (Russia)
San Sanskrit (Sanskrit)
Sin Sinhala; Sinhalese (Sinhala; Sinhala)
Slk Slovak (Slovakia)
Slk_frak Slovak – Fraktur (Slovakia – Fraktur)
SLV Slovenian (Slovenia)
Spa Spanish; Castilian (Spanish; ofcastile)
Spa_old Spanish; Castilian – Old (- old Castilian Spanish;)
SQI Albanian (Albania)
SRP Serbian (Serbia)
Srp_latn Serbian – Latin (Serbia – Latin)
SWA Swahili (Swahili)
Swe Swedish (Sweden)
SYR Syriac (Syria)
Tam Tamil (Tamil)
Tel Telugu (Telugu)
TGK Tajik (Tajikistan)
TGL Tagalog (Philippines)
Tha Thai (Thailand)
TIR Tigrinya (Teague ray)
Tur Turkish (Turkey)
Uig Uighur; Uyghur (Uygur; Uygur)
UKR Ukrainian (Ukraine)
URD Urdu (Urdu)
UZB Uzbek (Uzbekistan)
Uzb_cyrl Uzbek – Cyrilic (Uzbekistan – Cyrilic)
Vie Vietnamese (Vietnamese)
YID Yiddish (Yiddish)

3.Tesseract use

Terminal input command: Tesseract –help

Usage: Tesseract --help --help-psm --version Tesseract | | --list-langs [--tessdata-dir PATH] Tesseract --print-parameters [options [configfile Tesseract imagename|stdin...]...] outputbase|stdout [options...]] OCR options: --tessdata-dir PATH [configfile... Specify the location of tessdata path. --user-words PATH Specify the location of user words file. --user-patterns PATH Specify the location of user patterns file. -l LANG[+LANG] Specify language (s) used for OCR. -c VAR=VALUE Set value for config variables. Multiple -c arguments are allowed. -psm NUM Specify page segmentation mode. NOTE: These options must occur before any configfile. Page segmentation mode S: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. Assume a single uniform block of 5 vertically aligned text. 6 Assume a single uniform block of text. Treat the image as a single 7 text line. 8 Treat the image as a single word. the image as a 9 Treat single word in a circle. the image as a 10 Treat single character. Single options: -h, --help Show this help message. --help-psm Show page segmentation modes. -v, --version Show version information. --li St-langs List available for engine. --print-parameters Tesseract parameters Print to languages Tesseract stdout.

General use:

Use the default eng / text library, imgName pictures address, result Tesseract imgName result recognition results

Specified language:

Specify the use of simplified Chinese Tesseract -l / / chi_sim imgName result / Tesseract --list-langs language library view local existence

Specify multiple languages:

The specified / multi language, using -l chi_sim+eng imgName + Tesseract result.

There is a place to pay special attention to the parameters PSM

/ / input command, see PSM Tesseract --help-psm 0 Orientation and script parameter detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. Assume a single uniform block 5 of vertically aligned text. Assume a single uniform block 6 of text. 7 Treat the image as a single text line. Treat the image as a 8 single word. 9 Treat the image as a single word in a circle. Treat the image as a 10 single character. (translation may not be accurate, best original script (0): directional monitoring OSD 1 OSD 2) using automatic paging automatic Page, but do not use OSD or OCR (Optical Character Recognition, optical character recognition) 3 full automatic paging, but did not use OSD (default) to assume a variable size of a text column in the 4. 5 assuming a single block of vertically aligned text. 6 suppose a unified text block. 7 see the image as a single text line. 8 see the image as a single word. 9 see the image as a single word in a circle. 10 image as a single character.

It is important to select a different PSM value depending on the situation, and if the selection is not appropriate, it will lead to the failure of the identification.
example:

Mac text recognition (Tesseract-OCR for MAC)
1234.png

Use command:

Do not set the value of PSM / / 1234.png 1234 -l chi_sim Tesseract command: Tesseract Open Source OCR Engine print v3.04.01 with Leptonica Info in fopenReadFromMemory: work-around: writing to a temp file Empty page Empty page!!!! / / not to set the value of PSM Tesseract 1234.png 1234 -l chi_sim -psm ordered 6 successful recognition: 12341234

4 language training

Prepare in advance:
1.training tools. Brew install –with-training-tools 2.jTessBoxEditor command to install training Tesseract (
tools) at the time of installation.
3 training material

The material prepared here is as follows:

Mac text recognition (Tesseract-OCR for MAC)
hui.png
Mac text recognition (Tesseract-OCR for MAC)
yi.png

Execute command:

Tesseract hui.png -l chi_sim -psm 10 identification results: Look Tesseract yi.png -l chi_sim -psm 10 identification results: = Yi

Obviously, with recognition of chi_sim Library of Hui Yi of the two words is not very good. In order to identify these two words, we have to train these two words.

1 material synthesis, (multiple material synthesis)
open jTessBoxEditor tools, menu bar: tools-> Merge TIFF, select the picture to be synthesized and saved as: huiyi.fitt.

2 generate box file

Tesseract huiyi.tif Huiyi -l chi_sim command / -psm batch.nochop makebox 10

After the implementation of the huiyi.box file will be generated in a box.

Open with a text editor or xcode:

Look 31371121190 = 5186931061

Modified as:

All 31371121190 to 5186931061

Save file.

3 generate.Tr file

Tesseract huiyi.tif -psm 10 Huiyi command / nobatch box.train

4 generate unicharset file

Huiyi.box / unicharset_extractor command

Note that the unicharset_extractor command is an integrated command inside the training tools, which means that you do not have a training tools if the runtime does not find the command.

5 to create the font_properties file
font feature files, Tesseract-OCR version 3.01 and above to create a font_properties file before training. The format of the file format is as follows:

Fontname Italic bold fixed serif Fraktur / / font name translation bold serif tilt fixed width font Gothic

In addition to fonts other values are bool, 0 or 1

The content of font_properties here is:

Font 00000

Execute command:

Echo'font 0000 0'> font_properties

5.training

Execute command:

Shapeclustering -F -U unicharset font_properties huiyi.tr

Will generate: shapetable file, rename huiyi.shapetable

Execute command:

Mftraining -F -U unicharset -O huiyi.unicharset font_properties huiyi.tr

Will generate: huiyi.unicharset, inttemp, pffmtable file, inttemp, pffmtable renamed: huiyi.inttemp, huiyi.pffmtable

Execute command:

Cntraining huiyi.tr

Will generate: normproto file, rename huiyi.normproto

6 get traineddata file

Execute command:

Combine_tessdata huiyi. tessdata files TessdataManager combined / / print Combining Tesseract data files. Offset for type 0 (huiyi.config) is -1 Offset for type 1 (huiyi.unicharset) is 140 Offset for type 2 (huiyi.unicharambigs) is -1 Offset for type 3 (huiyi.inttemp) is 406 Offset for type 4 (huiyi.pffmtable) is 118222 Offset for type 5 (huiyi.normproto) is 118282 Offset for type 6 (huiyi.punc-dawg) is -1 Offset for type 7 (huiyi.word-dawg) is -1 Offset for type 8 (huiyi.number-dawg) is -1 Offset for type 9 (huiyi.freq-dawg) is -1 Offset for type 10 (huiyi.fixed-length-dawgs) is -1 Offset for type 1 1 (huiyi.cube-unicharset) is -1 Offset for type 12 (huiyi.cube-word-dawg) is -1 Offset for type 13 (huiyi.shapetable) is 118708 Offset for type 14 (huiyi.bigram-dawg) is -1 Offset for type 15 (huiyi.unambig-dawg) is -1 Offset for type 16 (huiyi.params-model) is -1 Output huiyi.traineddata created successfully.

Move huiyi.traineddata to the /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata/ directory

Execute command:

CP huiyi.traineddata /usr/local/Cellar/tesseract/3.04.01_2/share/tessdata/

7 verify
execution commands:

Tesseract hui.png -l Huiyi -psm 10 identification results: Look Tesseract yi.png -l Huiyi -psm 10 identification results: easy to Yi

Successful identification.

Conclusion: long time no write, this article is written before, have not been issued, just over the years has been very busy, no time to write. In the new year, I wish you all a promising future. Recently, there is no time to move to the iOS on the Tesseract, before the trial results are not good, mainly to identify the speed is slow, but there is not a very good algorithm for gray image processing.