LABLER (LAyout-Based Logical Entity Recognizer) discovers logical structure hierarchies in documents. It takes as input the results of OCR, slightly cleaned, and it performs a bottom-up analysis, relying primarily on layout-based, geometric cues, to form a logical hierarchy.
The reliance on layout is inspired by the observation that even when the text of a document is illegible, much of its logical structure is readily apparent. Consider, for instance, a zoomed-out view of a document, in which the pages are seen as though from too great a distance to read the characters. Much of the structure is still clear, such as the divisions into paragraphs and sections.
LABLER relies on the following observations.
LABLER represents shapes at each level in the hierarchy in terms of structures at a lower level. It first seeks repeated shapes, identifying these as structures that belong together. Then it repeatedly seeks to merge structures around interruptions (such as a paragraph interrupted by an equation) to form higher hierarchy levels. When no more of these are found, it groups together all repeated shapes at the current granularity level (in terms of the shapes found in step 1), then finds the top levels of the hierarchy based on vertical distance. As each level is found, its structures are classified; each structure is compared to prototypes for the known types and assigned the classification of the prototype to which it is most similar. Finally, the resulting hierarchy is coordinated, to remove repetitions and provide a coherent whole.
If you're interested in additional details, you can contact me at summers@cs.cornell.edu, or read my thesis.