In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR '95), pp. 462 - 465, Montréal, August 1995.
Abstract
Automatic derivation of logical document structure from generic layout
would enable a multiplicity of electronic document manipulation
tools of a type that is becoming crucial to users who wish
to browse the internet.
This problem can be divided into segmentation (dividing the text
into a hierarchy of pieces) and classification (categorizing
these pieces as particular logical structures.)
This paper proposes an approach to the classification of
logical document structures, according to their
distance from prototypes that are primarily geometric. The
prototypes consider linguistic information minimally,
thus relying minimally on the accuracy of OCR and
decreasing language-dependence. Different classes of logical
structures and the differences in the
requisite information for classifying them are presented.
A prototype format is proposed,
existing prototypes and a distance measurement are described, and
performance results are provided.