CharacTell's Unique Approach to Handwriting Recognition

CharacTell's Unique Approach to Handwriting Recognition

The recognition kernels of all the leading ICR technologies use feature extraction of the image of a character. Generally, hundreds or thousands of features are extracted from each character. These features are then investigated using many methods, such as neural network. All these methods require a large learning set, requiring tens of thousands of samples for 'learning' each character.
Up to today, the recognition of lower case letters in the form-processing market was considered to be almost unsolvable. Human handwriting includes many letters that may have ambiguous meanings: some people write the letter "e" or "r" in exactly the same way that others write the letter "c" or "v" respectively. Only CharacTell's approach can recognize and distinguish such problematic characters.

Currently, handwriting recognition products achieve about 85% recognition rate for mixed upper- and lowercase letters (non-cursive) and about 90% for upper cases only. This means that almost every second word requires human correction. Meanwhile, research has shown that people find handwriting recognition software to be useful only when it succeeds in recognizing over 97% of the characters.

CharacTell's JustICR, uses a new approach, which can be compared to the microbiology world of DNA. JustICR first creates a 'string' from the character's image. This string can be regarded as a "DNA chain". The DNA chain is made of sub-strings that can be called "genes".

JustICR then tries to identify the character. The recognition process of a character is analog to the problem of finding the father of the baby from the DNA chains. Each gene is matched with a database of genes that were produced from the learning set. Each gene is given a certain weight for each possible recognition result. The number of genes in each DNA chain is less than or equal to 28.

ICR experts feel that nothing can be new in the field of handwriting recognition. Can the character genes be represented as features? Well, no. The number of possible genes is huge while the number of features is fixed by the algorithm. Each gene can exist or not, while the features are generally integers and not Booleans. The only information contained in the genes are if they do or do not exist, and their location in the DNA chain.

Now comes the interesting part. The number of samples that are required to teach JustICR a new handwriting is extremely low. In fact, after teaching one sample for each character we already have a reasonable recognition rate.

CharacTell's new product, SoftWriting™, makes good use of JustICR's short learning curve. When SoftWriting tries to recognize an additional document, it uses the learning data from the previous document. Using the short learning curve behavior, SoftWriting achieves very high recognition quality from the first document it recognizes. The algorithm of the first document works as follows: after recognizing a small fraction of the document, all the recognized words that appear in the dictionary are used as the training set for the others. This method can improve the recognition rate from 50% per word to 90% per word from the first document submitted.

SoftWriting uses several proprietary technologies other than the JustICR's recognition capabilities. Unlike most of the other recognition engines that turn the images into a black and white image, SoftWriting scans the images in gray/color bitmap. SoftWriting includes a special algorithm that converts the gray/color images to black and white images. This algorithm is extremely important because scanning pads with blue pens generally create images of poor quality that are difficult to recognize after conversion, a proprietary algorithm that analyzes the lines, words and connected characters is applied. The recognition kernel also uses a dictionary in order to achieve best results.