I work on document analysis. My long-term goal is to provide support for sophisticated electronic document manipulation tools for indexing, browsing, linking, etc.
My primary interest is in discovering logical structure in arbitrary electronic documents. The goal is to take an electronic document representation as input and return a hierarchy of logical pieces of the document as output. For example, given a scanned-in or postscript version of a technical report, I would like to be able to divide it into sections, paragraphs, etc. Similarly, in a business letter, the address headings, body, and closing should be identifiable.
This problem has two primary components: segmentation (dividing the document into logical pieces) and classification (categorizing the pieces). It also raises the questions of evaluation (previous work differs in descriptions of the correct hierarchy), types of logical structures, and theoretical limitations.
The task is relevant to two of Bruce Croft's top 10 research issues for information retrieval (in the November 1995 issue of D-Lib Magazine): number 5, "interfaces and browsing," and number 3, "efficient, flexible, indexing and retrieval." Determining logical structure enables flexible, hierarchical browsing; doing so in a general way supports system flexibility and handling of multiple document types.
As my thesis project, I have implemented a system called LABLER (LAyout-Based Logical Entity Recognizer), which takes as input the (slightly cleaned) results of OCR and finds a logical structure hierarchy for the given document.Using Non-Textual Cues for Electronic
Document Browsing
Co-authored with Daniela Rus.
In Digital Libraries: Current Issues,
Nabil R. Adam, Bharat K. Bhargava, and Yelena Yesha, editors.
Chapter 9, pp. 129 - 162. Lecture Notes in Computer Science series.
Springer-Verlag, 1995.
Versions in:
Toward a Taxonomy of Logical Document Structures
Electronic Publishing and the Information Superhighway:
Proceedings of the Dartmouth Institute for Advanced Graduate Studies,
pp. 124 - 133, Boston, May 1995.
Donald B. Johnson Memorial DAGS Scholar
award for the best student paper, co-recipient.
Near-Wordless Document Structure
Classification
Proceedings of the International Conference on Document Analysis
and Recognition, pp. 426 - 456, Montréal, August 1995.