Handwritten Chinese character recognition using 300 lines of code

Original published in the personal blog Technology- handwritten Chinese character recognition, reproduced please indicate the source.

This paper describes how to learn by machine to achieve handwritten Chinese character recognition, the core algorithm using C++ prepared, less than 300 lines, the source code in Github-HandWriteRecognition.

Main ideas:

  • Extraction of each Chinese character stroke features, save into a font;
  • Obtaining the handwriting trajectory coordinates of a user through a tablet or a touch pad;
  • Coordinate preprocessing;
  • Through the KNN algorithm, and the font of each Chinese characters are compared;
  • Sort according to the size of the distance, the output results.

Font

Here the font is based on the Tomoe Handwriting Dictionary font for special treatment, the generated model file.

Tomoe font collection of the Chinese characters of each of the start and turning point, can be considered as a feature point. For example, the following
is “Ding” words, divided into two, the first pen only the starting point and the end point, the second pen also includes the transfer point:

< character> < utf8> /utf8> < < strokes> < stroke> < point x= "93" y= "198" /> < point x= "913" y= "205" /> < /stroke> < stroke> < point x= "495" y= "/> < point 203" x= "470" y= "847" /> < point x= "405" y= "784" /> < /stroke> < /strokes> < /character>

But the disadvantage of Tomoe font is very obvious, first of all, the coordinates are based on the size of the board (1000*1000), and we in the handwriting recognition, can not know the size of the region, the touch screen or tablet so that the data must be normalized to the same size of the panel, secondly, the use of is the XML format, resulting in redundant field very much, very large font (8.8M), space, and loading is very slow.

In view of the problem of data normalization, the following algorithm will refer to the processing method. Font file is too large, the use of a custom format, “Ding” as an example:

D: "(93198), (913205) [] (495203) (470847) (405784)"

Thus, the size of the original 8.8M compression to only 1.5M, according to the needs of the algorithm will be further compressed.

characteristic points

Due to the user’s handwriting coordinates, is continuous, and very much, we must extract the feature points for comparison with the feature points in the thesaurus. Feature point extraction, here, we are using to determine the point to line distance, the “t” as an example, the second feature points, the first is the start and end points, followed by the bite point, obviously we can see that straight from the start and end points into the turning point, the farthest distance. Therefore, as long as we set the appropriate threshold, we can find the turning point by the distance from the point to the line, plus the starting and ending points as the feature points.

Because, a stroke may have more than a turning point, so we pass the return to complete:

Void turnPoints (Stroke *stroke, std: Point> *points: vector< int, pointIndex1, int pointIndex2) {if (pointIndex1 < 0 || pointIndex2 < pointIndex1 > = 0 ||; pointIndex2 = - 1) return const float; a = stroke-> points [pointIndex2].x stroke-> points[pointIndex1].x const float; b = stroke-> points[pointIndex2].y stroke-> points[pointIndex1].y; const float C = stroke-> points[pointIndex1].x * stroke-> points[pointIndex2].y - stroke-> points[pointIndex2].x * stroke-> points[pointIndex1].y float int; max = 3000; maxDistPointIndex = -1; for (int i = pointIndex1 + 1; I < pointIndex2; ++i) {Point point = stroke-> points[i]; const float dist = Fabs ((a * point.y * B (-) Point.x) + C); std: cout < < dist; < < std:: endl; if (dist > max) {max = dist; maxDistPointIndex = I;}} if (maxDistPointIndex! = -1) {turnPoints (stroke, points, pointIndex1, maxDistPointIndex; points-> push_back (stroke->) points[maxDistPointIndex];; turnPoints (stroke), points, maxDistPointIndex, pointIndex2);}}

algorithm

Frechet

The algorithm used here, the beginning is to use Frechet Distance to judge.

Frechet Distance: first A to find out the curve B curve of the farthest point, and then calculate the closest point to the B curve, the shortest distance and then turn to the A curve B curve calculation, take a maximum of two of the shortest distance, as the similarity of two curves.

However, in the experiment, Frechet Distance has a lot of defects, first of all, if the word as a curve, compared with another word, is clearly not enough, because some of the words may be very complex, such as “elliptical”, curves are crossed, the calculated distance no reference; secondly, if through the comparison of single stroke, because the handwriting area and the area of the user’s character is not the same as a coincidence, so the calculated distance will also have problems, such as:

The word "one" word coordinates: (0, 50) -> (20, 50); handwritten character "one" coordinates: (0, 80) -> (20, 80), distance: 30 handwritten word: "|" coordinates (10, 40) -> (10, 60); distance: 14.

Therefore, after the experiment, the decision to abandon the use of Frechet Distance to determine the similarity of words, and through the point of view of the characteristics of the straight line to compare the use of KNN algorithm.

KNN

First, we calculate the similarity of single stroke, the angle of the single stroke and the straight line of the previous point:

Double diretion (const Point & lastPoint, const Point; & startPoint) {double result = -1; result = atan2 (startPoint.y - lastPoint.y, startPoint.x - lastPoint.x) * 10; return result;}

The number of feature points may not correspond to the two ways, one is the interpolation, one is sampling, here is the sampling mode, in addition, for each pen, you also need to add the starting point and end point of a linear perspective, to avoid the “d” word recognition “ten” word, the calculation is as follows:

Double distBetweenStrokes (const Stroke & stroke1, const Stroke; & stroke2) {double strokeDist = MAXFLOAT double; dist = 0.0F; int = minLength Fmin (stroke1.points.size), (stroke2.points.size) (Stroke); largeStroke = stroke1.points.size (>); stroke1 stroke2; Stroke minLength?: smallStroke = stroke1.points.size (>); minLength? Stroke2: stroke1; for (int j = 1; J < minLength; ++j) {double diretion1 = largeStroke.points[j].Diretion; double diretion2 = smallStroke.points[j].diretion; dist = Fabs (diretion1 - diretion2);} / / the pen and a pen on the position difference between the dist + = Fabs (largeStroke.points[0].diretion - smallStroke.points[0].diretion); strokeDist = dist / minLength Return strokeDist;}

Here, we can get to the user and the handwriting font inside word similarity, single stroke through the addition, you can get the degree of similarity of the whole word, but because of the existence of cursive, for example, written in two, so we allow the number of strokes, the error is less than 2, but when sorting, stroke the number of the closer, the higher priority is calculated as follows:

Double dist (const Character & character1, const Character; & character2) {double dist = MAXFLOAT; if (character2.strokeCount > character1.strokeCount = & & character2.strokeCount < double = character1.strokeCount + 2) {allStrokeDist = 0.0F; for (int i = 0; I < character1.strokeCount; ++i) {Stroke = stroke1 character1.strokes[i] Stroke; stroke2 = character2.strokes[i]; double = strokeDist distBetweenStrokes (stroke1, stroke2); allStrokeDist = strokeDist; if (strokeDist > MAX_DIFF_PER_STROKE) {allStrokeDist = MAXFLOAT; return allStrokeDist;}} / / stroke is more close to the higher priority Return allStrokeDist / character1.strokeCount + character2.strokeCount - character1.strokeCount;} return dist;}

Note that, here, we use only two lines of the angle, and the actual size of the coordinates are not linked, so you can further streamline the font:

Martin: [[0, 0.08][31.3, 16.09, 23.71]]

Further reduce the size of the thesaurus.

Effect

Through the establishment of thesaurus, the feature points, as well as matching algorithm, we can see the final effect of handwriting recognition is as follows:

Handwritten Chinese character recognition using 300 lines of code
Handwritten Chinese character recognition using 300 lines of code
Handwritten Chinese character recognition using 300 lines of code
Handwritten Chinese character recognition using 300 lines of code

As long as the handwriting is not particularly sloppy, basically the first word can be identified. However, there still exists the problem of relying on the stroke sequence.

Look and see:

How to do a Pinyin input method within a week
iOS- thread synchronization