|
Free Grafix
Archives


Trackion Laptop Tracking


| |
OCR Software - Optical Character Recognition or Optical Crud Recognition?
by James Eglin
Is it really possible to get high OCR accuracy from poor quality documents?
Optical Character Recognition (OCR) refers to a software technology and
processes that involve the translation of printed text into computer searchable
text.
Done correctly, OCR enables users to search for and retrieve individual words
contained within a file or page. In addition, when a set of files is indexed,
users are able to search for keywords across an entire document library and
retrieve each page with exact precision. OCR enables users to execute searches
in seconds, searches that once could take several hours or days to complete.
However, this technology did not work well on older or poor quality documents
that contained mixed fonts or combinations of texts and graphics. Until now!!
Due to several recent technology advances, it is now possible to obtain
six-sigma level character accuracy from these types of document collections.
Although it is important to keep in mind that the quality and condition of the
paper documents are still key factors in the successful OCR conversion,
dramatically improved results can be obtained by enhancing the quality of the
scanned image prior to processing.
Noise removal of borders, speckles and skews are now common on the more advanced
document scanners.
Furthermore, advanced color filter technologies may be used to reduce any page
background colors, in conjunction with multi-light image capture technologies to
remove any shadows cast by page creases that could impact image quality or
recognition accuracy.
Once document scanning and processing are complete, an OCR text layer can
actually be added and hidden behind each image. An additional orientation filter
can be used to ensure that the best image is presented to the OCR engines.
To achieve the highest conversion accuracy possible, the characters in the image
can be processed using multi-engine OCR voting technologies that rank each
character to determine the best text recognition fit. Then once a word is
generated, it will be filtered through a proprietary lexicon to ensure the
highest quality results.
Finally, this text can be processed utilizing sophisticated layout retention
technologies to represent the image text layout, to provide the best possible
text representation for precise search and retrieval. After all, isn't that why
they call it Optical Character Recognition?
About the Author
EVP, Global Sales and Marketing
http://www.DigitalDocumentsLLC.com - The Leading Provider of Document
Scanning and Imaging Services
| |
Back Home Up Next


|