Basic OmniPage OCR Technologies
Understanding OCR 251
challenged his development team to create an OCR system that could read
anything that may be in a typical office: even a magazine page with its
mixture of fonts, text, and graphics.
Page Analysis
An 8 1/2” by 11" document at 300 dots per inch creates a 1.2 MB file.
AnyFont first puts that entire file into RAM and looks at the complete
page. From there, the algorithms look for and separate areas which may
contain graphics by using image-processing techniques such as texture
and density analysis and edge detection.
Next, the algorithm looks for columns of text and determines their size
and shape. White lines between the text are discovered, with no
preconceived rules regarding leading (the space between lines). This step
allows OmniPage to recognize drop caps as the font size of each letter is
not pre-determined. Finally, the spaces between the letters are found, and
within this variable-sized zone lies a character of somewhere between 6
and 72 points. This zoom-in from the entire page technique allows
OmniPage to recognize the widest variety of font sizes in the industry.
Character Experts
Each character box is sent to a team of 100 expert systems with each
subsystem responsible for the identification of a single character. This is in
marked contrast to other OCR technologies which are based upon a
probability analysis of dots within a matrix field.
By the probability approach, the program is never certain if the character
is an “a” or an “a,” but only the probability of each being the case. If the
probability of it being an “a” is 89% and of being a “c” is 69%, then the “a”
is reported. However, had the probability been 69% for a “c” and only 58%
for an “a,” then the “c” would have been reported. In either case, it is a
guess and suffers an inherent weakness of substitution errors. Products
based on this technology rely heavily on dictionaries and language-usage
probabilities to help determine which of their guesses belongs in a known
word. This doesn’t help much with proper nouns, new terminology, or
abbreviations.
With AnyFont’s 100 experts, each expert is responsible for identifying a
character image. The first expert evaluates the image and decides if it is the
character for which it is responsible. If it is not certain, it passes the
character to the next expert. This continues until each character is
recognized. No probabilities or guesses are used. This individual expert
technology is theoretically the most accurate method possible of doing
character recognition. It does not depend upon the averaging of a data
base and the variations or guess work of probability pools. Each expert for