Basic OmniPage OCR Technologies
Understanding OCR 252
each character can be infinitely tuned and re-tuned as new fonts or new
problems come up.
If there is a problem with “c”s and “e”s, additional tuning of those two
experts is done until that one problem is resolved. To recognize a foreign
language that has an “ä” as well as an “a,” another expert to identify the
new character is added. This expert approach to recognition is what
allows OmniPage to recognize more languages (13) than any other OCR
package. It is this expert approach that provides OmniPage with so
remarkably few substitution errors.
The inherent accuracy of the algorithm has always been the most
important parameter at Caere. Experts provide that; however, they have
two down sides. One is that they are remarkably difficult and time-
intensive to program. Such an approach would be incredibly accurate for
Kanji, but would take several hundred man-years to program the 5000
character experts needed! Machine-learned database probability pools or
neural nets are the most practical approaches for such a language.
Self-Learning OCR
The other downside of experts is that they are very computer intensive,
and therefore somewhat slow. One of the patents pending for Caere has to
do with an accelerating, self-learning routine which allows each unique
character to only have to be recognized once. From then on, the system
will identify it as another “a” or “b” without having to reanalyze it each
time with the experts. This accelerator technique makes OmniPage
actually speed up as it reads a document. This technology, operating in
true 32-bit mode, makes AnyFont the fastest omnifont OCR algorithms in
the world, with speeds of up to 4000 words per minute attainable on the
faster PCs.
Sometimes none of the experts are able to identify the character. This can
happen with broken or overlapping characters. This is solved by
AnyFont’s second pass. It can be seen on the screen as the light blue areas
of the document image are painted a darker blue. The characters, or pieces
of characters, that get past the experts are put in a separate buffer to be
dealt with later. A series of very sophisticated routines come into play for
splitting, combining, fragment analysis, fatting, thinning, and context
checking. The quality and sophistication of these second pass routines
provide greater recognition accuracy, even for very difficult problem
characters and parts of characters. A third pass allows the Language
Analyst to further refine accuracy.