|
|
|
|
|
|
|
|
|
|
|
|
|
|
CharacTell's Unique Approach to Handwriting Recognition
The recognition kernels of all the leading ICR technologies use feature extraction of the
image of a character. Generally,
hundreds or thousands of features are extracted from each character. These features are then
investigated using many
methods, such as neural network. All these methods require a large learning set, requiring
tens of thousands of samples
for 'learning' each character.
Up to today, the recognition of lower case letters in the form-processing market was
considered to be almost unsolvable.
Human handwriting includes many letters that may have ambiguous meanings: some people write
the letter "e" or "r" in
exactly the same way that others write the letter "c" or "v" respectively. Only CharacTell's
approach can recognize and
distinguish such problematic characters.
Currently, handwriting recognition products achieve about 85% recognition rate for mixed
upper- and lowercase letters
(non-cursive) and about 90% for upper cases only. This means that almost every second word
requires human correction.
Meanwhile, research has shown that people find handwriting recognition software to be useful
only when it succeeds in
recognizing over 97% of the characters.
CharacTell's JustICR, uses a new approach, which can be compared to the microbiology
world of DNA. JustICR first creates a
'string' from the character's image. This string can be regarded as a "DNA chain". The DNA
chain is made of sub-strings that
can be called "genes".
JustICR then tries to identify the character. The recognition process of a character is analog
to the problem of finding the father
of the baby from the DNA chains. Each gene is matched with a database of genes that were
produced from the learning set.
Each gene is given a certain weight for each possible recognition result. The number of genes
in each DNA chain is less than
or equal to 28.
ICR experts feel that nothing can be new in the field of handwriting recognition. Can the
character genes be represented as
features? Well, no. The number of possible genes is huge while the number of features is fixed
by the algorithm. Each gene can
exist or not, while the features are generally integers and not Booleans. The only information
contained in the genes are if they
do or do not exist, and their location in the DNA chain.
Now comes the interesting part. The number of samples that are required to teach JustICR a new
handwriting is extremely low.
In fact, after teaching one sample for each character we already have a reasonable
recognition rate.
CharacTell's new product, SoftWriting, makes good use of JustICR's short learning
curve. When SoftWriting tries to
recognize an additional document, it uses the learning data from the previous document. Using
the short learning curve
behavior, SoftWriting achieves very high recognition quality from the first document it
recognizes. The algorithm of the first
document works as follows: after recognizing a small fraction of the document, all the
recognized words that appear in the
dictionary are used as the training set for the others. This method can improve the
recognition rate from 50% per word to 90%
per word from the first document submitted.
SoftWriting uses several proprietary technologies other than the JustICR's recognition
capabilities. Unlike most of the other
recognition engines that turn the images into a black and white image, SoftWriting scans the
images in gray/color bitmap.
SoftWriting includes a special algorithm that converts the gray/color images to black and
white images. This algorithm is
extremely important because scanning pads with blue pens generally create images of poor
quality that are difficult to recognize
after conversion, a proprietary algorithm that analyzes the lines, words and connected
characters is applied. The recognition
kernel also uses a dictionary in order to achieve best results.
|
|
|
|
|