0% found this document useful (0 votes)

3 views12 pages

Document Structure Analysis

This paper presents a method for retrieving information from document images in digital libraries using knowledge-based layout analysis and logical structure derivation. It discusses how queries are parsed to identify document types and desired information levels, enabling the extraction of relevant logical components from documents. The approach is implemented in a 'document browser' application that allows users to interactively refine their queries and view results as images or formatted text.

Uploaded by

skr2010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views12 pages

Document Structure Analysis

Uploaded by

skr2010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

The use of document structure analysis to retrieve

information from documents in digital libraries

Debashish Niyogi and Sargur N. Srihari
Center of Excellence for Document Analysis and Recognition
State University of New York at Buffalo, Buffalo, NY 14228-2567
{niyogi,srihari}©cedar.buffalo.edu

ABSTRACT

This paper describes an approach to retrieving information from document images stored in a digital library
by means of knowledge-based layout analysis and logical structure derivation techniques. Queries on document
image content are categorized in terms of the type of information that is desired (e.g., articles on a given topic),
and are parsed to determine the type of document from which information is desired, the syntactic level of the
information desired, and the level of analysis required to extract the information. Using these clauses in the
query, a set of salient documents are retrieved, layout analysis and logical structure derivation are performed on
the retrieved documents (using DeLoS, a document logical structure derivation system developed at CEDAR),
and the documents are then analyzed in detail to extract the relevant logical components. A "document browser"
application, being developed based on this approach, allows an user to interactively specify queries on the docu-
ments in the digital library using a graphical user interface, provides feedback about the candidate documents at
each stage of the retrieval process, and allows refinements of the query based on the intermediate results of the
search. Results of a query are displayed either as an image or as formatted text.

Keywords: Document image understanding, logical structure analysis, layout analysis, information retrieval,
digital libraries.

1. INTRODUCTION

In document image understanding, syntactic and semantic interpretations of the structure and contents of
document images can be obtained using layout analysis and logical structure analysis. This paper describes an
approach to retrieving information from document images stored in a digital library by means of knowledge-based
document layout analysis and logical structure derivation techniques.

Digital document libraries have become an increasingly important means of storing critical information within
organizations. The need for accurate and up-to-date information has led to the practice of storing important
information in digital documents that can be made accessible to a wide variety of potential users. This is
particularly useful when the information needs to be efficiently represented, updated, and instantly retrievable
on demand.

SPIE Vol. 3027 • 0277-786X/971$10.00

207
A digital document library typically contains a set of document images that have been indexed according to
document type. Additional indexing on documents can be achieved by performing OCR on each document and
extracting textual information, but maintaining such a detailed index is often neither efficient nor desirable. One
alternative solution is to create a detailed index for a document only when required by a user query, so that only
relevant documents are indexed and retrieved. Also, typically a query on a document library yields complete
documents that satisfy the query criteria; it is often desirable to extract only relevant parts of a document, a
process that is is much harder and requires more detailed knowledge of the document structure.

Such detailed knowledge of document structure can be obtained by performing layout analysis and logical
structure derivation on the document. Document layout analysis and logical structure derivation enable us to
determine the relationship between the physical layout of a document page (consisting of the geometric structure
and spatial relationships of the different blocks of printed matter) and its logical layout (consisting of the logical
groupings of related blocks into composite units). This can then be used to actually "translate" a physical
document into its logical symbolic representation, and thereby enable the extraction of information contained in
a specified part of a document.

Document layout analysis and logical structure derivation (together called document structure analysis) are
thus crucial steps in the development of digital document libraries. Extraction of the logical structure of a
document enables individual logical components to be identified and indexed separately in digital libraries, thus
making their access much faster and easier.

The following sections describe an approach that uses document structure analysis to identify and retrieve
information from documents in digital libraries.

2. BACKGROUND

Some ideas that we have developed with respect to the retrieval of data from raster images of documents stored
in a digital library have been previously described by Srthari et al.." Related work in the analysis and retrieval
of information from document images has been ongoing at CEDAR for several years. For example, retrieval of
the identities of human faces in a photograph with the help of its caption has been described by R. Srihari.'°
The determination of the logical structure of a document (i.e., labeling all identifiable blocks, grouping them into
logical units, and determining the reading order of the text blocks within a given unit), has been described by
Niyogi & Srihari (,58). Multi-domain document layout understanding has been described by Lam & Srthari.' An
integrated approach to document decomposition and structural analysis has been described by Niyogi & Srihari.6
Techniques for analyzing printed forms, which are examples of complex documents, are described by Niyogi et
al.9 Each of these techniques is being used as part of our approach to analyze and recognize selected components
of a raster document image and present the information to a user.

3. DOCUMENT CONTENT RETRIEVAL

We have developed an approach by which document content is retrieved from a set of document images by
means of progressively refined levels of document structure analysis. Documents are indexed by performing layout
analysis (to extract the syntactic structure of the document) and then logical structure derivation (to extract the
semantic structure of the document). The above process yields a hierarchical index of the document structure.
This indexing is done at a level that identifies logical units, their spatial relationships, and their components.

208
4. LOGICAL STRUCTURE ANALYSIS

At CEDAR we have developed and implemented a knowledge-based system for document decomposition and
structural analysis. This system, called DeLoS ("DErivation of LOgical Structure") takes as input the digitized
image of a document page and produces as output a symbolic description of the logical structure of the page,
including a labeling of each of the printed blocks on the page, a grouping of these blocks into logical units, and
the reading order of text blocks within each of these units. Various image processing operations are performed to
extract basic properties of the components of the image, and this data is analyzed under the control of a rule-based
system, which operates in conjunction with a global data structure to monitor the entire classification, grouping,
and block-ordering process. Some details of the design and development of this system have been described in
previously published papers (,75).

4.1. System Organization

Figure 1 shows the components of the DeLoS system. The unbroken arrows between the boxes indicate the
flow of data between the various components of the system, and the dotted arrows represent the control flow in
the system.

Figure 1: The DeLoS system.

The DeLoS system consists of a multi-level, rule-based reasoning system, an image processing sub-system,
and a partitioned global data structure. The rule-based system utilizes a top-down, backward-chaining structure.
An inference engine within the rule-based system makes deductions about the document using a hierarchical
knowledge base that contains rules describing all the identifiable characteristics of document images. The rules
are classified into three levels: knowledge rules, control rules, and strategy rules. The global data structure

209
facilitates the transfer of information between the image processing modules and the rule-based system. A
common data area stores all intermediate computation results and other control information.

The image-processing modules directly access the document image to extract various kinds of information
about the document. Intrinsic properties ofthe different printed blocks as well as the spatial relationships between
the different blocks constitute the information that is passed back to the control structure through the global
data structure. Some image-processing modules perform the basic tasks of binarization, connected-component
analysis, etc., so as to prepare the image for other modules (e.g., block segmentation) that can extract information
from it.

4.2. Rule-Based Control Structure

The rule-based system consists of three levels of rules. Using a hierarchical structure of three progressively
abstract levels of rules provides a large amount of flexibility in the inference mechanism, and allowed a modular
formulation of the solution within the image analysis problem domain. The three levels into which the rules
in this system are classified are: Knowledge Rules (level 1), Control Rules (level 2), and Strategy Rules (level
3). The domain knowledge in the system is maintained in the Knowledge rules, and the control decisions in the
system are made by the Control and Strategy rules.

4.2.1. The Knowledge Base

The Knowledge rules comprise the knowledge base that contains all the domain knowledge for the system,
expressed in terms of first-order predicates. These rules define the general characteristics expected of the usual
components of a document image and the usual relationships between such components in the image. Thus, all
common characteristics of different types of document blocks (e.g., text blocks, photographs, etc.), as well as spa-
tial constraints commonly followed in document layout (e.g., the positioning of captions relative to photographs,
etc.), are encoded into the knowledge base. These knowledge rules in the knowledge base contain all the prop-
erties and spatial relations for different types of document blocks, and can be used for block classification, block
grouping, or text block ordering as and when required according to the control strategy. Knowledge rules can be
further categorized as unary rules, simple binary rules, and Complex binary rules. The hierarchical structure of
the knowledge base mentioned above is thus created by these different categories of rules.

4.2.2. The Inference Engine

The control structure for the rule-based system contains an inference engine which is also rule-based, and
contains two levels of rules: control rides and strategy rules. These rules regulate the analysis of the document
image, and decide when a consistent interpretation of the image has been obtained. The rules comprising the
inference engine are also formulated in terms of first-order predicates. The control structure determines the
order in which these rules are executed in order to test various conditions effectively. Control rules regulate the
invocation of the knowledge rules, based on appropriate data configurations or processing states. Control rules
can be further categorized as focus-of-attention rides and meta-rules. Strategy rules guide the search in a more
general way, i.e., they determine what control strategy is to be followed at any given time for analyzing the image.
This means that the strategy rules regulate the invocation of, and determine the execution order of, the control
rules. Strategy rules also decide on the stopping criteria for the system, i.e., whether a consistent interpretation
and grouping of the blocks in the document image has been achieved (as determined by the absence of incomplete
block or unit data in the global data structure, and by the completeness of the logical structure tree). Therefore,
there is a set of strategy rules for block classification, another set for block grouping, and yet another for text
block ordering.

210
4.3. The Global Data Structure

The global data structure stores the physical structure and logical structure information for the document being
processed. It also facilitates the transfer of information between the rule-based system and the image processing
modules. A common data area stores all intermediate computation results and other control information, and
provides the framework for the construction of the trees representing the document structures. In this system,
the global data structure is divided into the domain data partition and the control data partition.

5. QUERY PROCESSING

Queries on document image content are categorized in terms of the type of information that is desired (e.g.,
articles on a given topic, the citation list in a journal article, a photograph with a given caption, etc.). A specific
query is first parsed to determine the following clauses:

1. the type of document from which information is desired (e.g., journal page, newspaper page, form, etc.),
2. the syntactic level of the information desired (e.g., a complete article, a text block, a photograph/diagram,
a list, etc.),
3. the level of analysis that is required to extract the information (e.g., reading a title/headline, reading the
entire text of an article, extracting the contents of a form, etc.).

Using the first clause in the query (i.e., the type of document), the standard indexes for the various documents
in the document library are searched, and a set of salient documents are retrieved in the first stage. For example,
a query that refers to a journal article will result in the retrieval of only the journal document images from the
library.

Using the second clause of the query (i.e., the type of document subset to be retrieved), layout analysis and
logical structure derivation are performed on the retrieved set of documents. All documents that do not contain
the relevant component are then eliminated from the set (e.g., if the query refers to a photograph, then all
documents not containing photographs are eliminated).

Finally, using the third clause in the query (i.e., the level of analysis required), the documents in the retrieved
set are then analyzed in more detail to extract the text in the titles, captions, etc., and text understanding is
performed if necessary on the extracted text to generate a keyword index. Using this information, the query is
refined and the retrieved set of documents searched so as to retrieve the relevant logical unit(s).

6. DIGITAL LIBRARY ARCHITECTURE

At CEDAR, we have integrated multiple information sources to build a digital library of research information
on document analysis and recognition. For our digital library, paper documents are scanned for conversion to
a form that enables text-based search. The digital library also includes project specific material such as slide
presentations, source code, data and video.

Taxila, the CEDAR Digital Library, It collects the results of research and development on various aspects of
document analysis and recognition into a unified form for access by all CEDAR researchers. This digital library
contains information in several modes, including document images, text files, executable object code, images,

211
videos, and audio files. The information is organized so as to be retrievable through a World Wide Web interface
on a server that is accessible only within CEDAR.

6.1. Organization of Taxila

The CEDAR digital library consists of a research information repository, a set of information sources that
classify each piece of information for retrieval, and an information retrieval mechanism through which information
is searched for and presented in response to user queries.

The CEDAR digital library consists of the following information sources, described as modules:

. Image databases used by CEDAR researchers

. Interactive computer demonstrations of systems developed at CEDAR
S Directory Archives of information about projects at CEDAR
. Scanned versions of papers written by CEDAR researchers
I Slide shows of presentations given by CEDAR staff members
S Source code developed for various projects at CEDAR
. Descriptions of techniques / methodologies developed at CEDAR
. Tools that operate on the CEDAR image databases
I Videos related to CEDAR research
Each of these modules represent an information source that contains data in multiple modalities. The mdi-
vidual modules are described in more detail below.

6.1.1. Techniques

Innovative techniques and methodologies developed at CEDAR for the solution of various problems in doc-
ument analysis and recognition constitute a very important part of the digital library. Major areas of research
within CEDAR, such as Word Recognition, Japanese Character Recognition, etc., are maintained in this module
of the library. For each area, detailed information is maintained, including descriptions of techniques, demon-
strations, and reference sources. All of this information is accessible by topic/area. The Techniques module is
composed of several sub-modules, corresponding to the major categories into which CEDAR research can be
divided. Each sub-module is further divided into specific research topics. Some research techniques are indexed
under various sub-modules because of their applicabifity to different research areas.

6.1.2. Databases

Various document image databases used in ongoing research at CEDAR are maintained within the digital
library. These images are in many formats, including Sun Raster, TIFF, GIF, PostScript, HIPS,3 etc. The
databases included in the digital library are standard databases of images that have been generated for specific
projects at CEDAR. These include databases of handwritten and printed characters & words, images of journal
and newspaper pages, etc., and are indexed in terms of their type, source and resolution. The images that are
in formats supported by common WWW broswers (viz., TIFF, GIF, and PostScript) can be directly displayed;
some images are inlined within the browser window (e.g., GIF) while the browser generates a new window to
display others (e.g., PostScript). Images in other formats have to be displayed outside of the browser with the
appropriate tools (e.g., the "chips" utility to display/manipulate HIPS format images).

212
6.1.3. Demonstrations

Demonstrations of systems developed at CEDAR have been incorporated into the digital library. These demon-
strations illustrate various document analysis and character recognition techniques that have been developed by
CEDAR researchers. Demonstrations in the digital library can be static or dynamic. Static demonstrations show
pre-processed results from "canned" demonstrations; this is particularly useful when the individual steps involved
in the demonstration are time-intensive, as is the case with many complex image processing operations. Dynamic
demonstrations execute a sequence of programs based on inputs given by a user; such inputs are generated from
a form that is displayed to the user through a HTML page and filled in and "submitted" by the user to the server
which then executes the appropriate programs and displays the results to the user on another HTML page after
suitable format conversions.

6.1.4. Directory Archives

Directory archives of information sources for current and past projects at CEDAR are easily retrievable through
the digital library. The directory archives are indexed by topic, and each directory archive contains a variety of
files in different formats and modalities. For example, a given directory archive may contain a set of program
source codes, the corresponding executable object codes, some test images, a text description of the technique
used in each of the programs, etc.

6.1.5. Scanned Papers

Scanned versions of research papers are maintained in the CEDAR digital library. The papers are scanned
on a Apunix scanner at a resolution of 300 ppi, and the Sun raster files are then converted to GIF format. Users
have the capability to display a paper sequentially by page, or to access a specific page of the paper. Also, papers
can be printed out on demand for hardcopy circulation if needed. Scanned papers are indexed by topic, and more
sophisticated indexing by keywords and concepts is being developed. In addition, other information about each
paper is maintained, such as the bibliographic information, including the name of the publication in which the
paper appeared, list of authors, date of publication, page numbers, etc.

6.1.6. Slide Shows

Slide shows of presentations are maintained and catalogued in the digital library. Each slide show contains a
set of linked slides which constitute a complete presentation. Users of the digital library can either access the slide
shows in a sequential manner, or select specific slides from within each slide show and "compose" their own slide
show from the selected slides. The latter is particularly useful for researchers who give overview presentations of
ongoing research that includes descriptions of techniques developed by various project group members.

6.1.7. Source Code

Source code developed for various projects at CEDAR is an intrinsic part of the digital library. Our objective
has been to collect source code from various programs that have been developed to solve certain basic problems,
and present them to the researcher who can then modify the code according to the project needs, or simply read
the code to gain insights into modular coding techniques. We have collected source code for image processing
operations such as connected component analysis, image rotation, image segmentation, layout analysis, etc. The

213
authorship of each piece of code is specified, and since the ttse of the code is strictly internal within CEDAR, no
code copyrights are violated. The code is made available by the author after extensive testing, and is thus only
added to the digital library when it has performed the specified task on a large sample data set. Source code
included in the digital library can be changed only by the author, and this change is transparent to the digital
library retrieval system as long as the file names remain the same.

6.1.8. Tools

Executable tools that operate on the CEDAR image databases are an integral component of the digital library.
Many commonly used image processing, vision, and statistical toolboxes and packages already exist on CEDAR
machines, and our objective is to make these tools available to all CEDAR researchers through the digital library.
By providing these tools to the researchers, we make it possible for them to conduct research in an efficient and
productive manner. System tools are tools that enable us to perform basic system functions, such as file viewers,
compilers, editors, debuggers, etc. — these are generally available. The digital library contains application
tools, i.e., tools that perform specific operations, such as image processing, vision, neural nets, matrix & vector
operations packages, statistical packages, etc., as well as tools developed by CEDAR researchers for solving specific
problems in document analysis and recognition. Application tools provide modularity and versatility to the digital
library since they execute on data in the document archives. This also results in a "bootstrapping" effect, since
the application tools that manipulate the data, e.g., for document analysis and optical character recognition, are
derived from contents of the data archives.

6.1.9. Videos

Videos related to CEDAR research are included in the digital library. The objective is to make available
multimedia information about different research techniques, as well as their impact on the state-of-the-art in
current research and the development of products based on the research. Multimedia presentations about CEDAR
research from such diverse sources as scientific television programs and animated illustrations of techniques are
stored in the digital library in MPEG or QuickTime format, and when accessed are acted on by the appropriate
viewing software that "plays" the video on the user's workstation. We are also investigating techniques to "index"
video sequences, so that the user can go to a specific part of a video based on topic-based annotations created for
sequences of interest.

6.2. System Implementation

The server for the CEDAR digital library is a dedicated Sun SPARCstation 2 running the NCSA HTTP
daemon version 1.5 under Solaris 5.5, and is connected to a local CEDAR research network of over 150 Sun
workstations. Because of this local network connectivity, all information that is stored in the CEDAR shared file
systems is directly accessible through the digital library. Since the information contained in this digital library
includes results from ongoing sponsored research, the digital library is built as an "intranet" and is not accessible
outside of CEDAR.

New information is incorporated into the digital library by first determining the basic information source,
i.e., to which section of the library it belongs. This could be considered equivalent to "cataloguing" a new book
that is received into a physical library. A primary link is made to the new information through the chosen
section, and secondary links are created from other sections that are relevant. For example, when information
related to Japanese Optical Character Recognition (JOCR) was added to the library, the primary link was to the
Techniques section, where a page was created which gave a brief description of the project, and had links to a
JOCR demonstration as well as a link to a description of the CEDAR JOCR database. Secondary links from the

214
Demonstration and Database sections of the library were then established.

A system has been set up within CEDAR whereby every time a research project, or a logically distinct piece of
research, is completed, the researchers involved in that effort use a standard set of specified procedures to convert
their research material into the appropriate formats required for the different sections of the digital library, into
which the material is then included. Similarly, ongoing research is included in the digital library when it is deemed
to be of interest to a significant number of researchers spanning project groups. Each researcher performs the
necessary format conversions according to the specified procedures. The researcher is marked as the author of
the material when it is converted for HTTP access.

The actual "publishing" of the material, i.e., the incorporation of converted material into the digital library, is
done by the CEDAR Digital Librarian, or "cybrarian" ., who determines the sections/sub-sections into which each
homogeneous unit of information should be placed. This is done in consultation with the appropriate Project
Manager who supervised the research. Once the information is placed in the digital library, extensive cross-
indexing is performed from every relevant section and sub-section, so that the material is directly accessible from
each relevant portion of the digital library.

6.3. Representation of Information

As previously mentioned, each section in the CEDAR digital library represents a specific information source,
and there is extensive cross-referencing between the different sections. Each information source is represented
by its own "home page" , which contains a brief description of the information source as well as links to all the
information accessible in the library that pertains to that information source. For example, the home page for
Source Codes contains a brief description of the types of source codes available through the library, and contains
links (with descriptions) to individually complete source codes that perform functions such as journal image
segmentation, connected component analysis, holistic off-line handwritten word recognition, etc.

Currently, each piece of information is a file in its native format. For example, images are stored as GIF,
TIFF, or HIPS format files, research papers are stored as HTML documents or PostScript files, etc. The full
range of data formats supported in the digital library is shown in Table 1.

Type of document Supported formats

Description of techniques HTML, ASCII text
Scanned papers GIF, TIFF, HIPS, PostScript, Sun Raster
Source code ASCII text
Videos Quicktime, MPEG
Demonstrations Perl, C, CGI Forms, Java, JavaScript
Directory archives FTP-able Solaris file structures
Databases GIF, TIFF, HIPS, PostScript, Sun Raster, ASCII
Slide Shows PostScript, GIF, HTML, Java
Tools Solaris object code

Table 1: Data formats supported by Taxila.

File format conversion is a very important consideration for maintaining the versatility of the digital 1ibrtry.

215
www browsers such as Netscape Navigator which display HTML documents will normally allow GIF/TIFF files
as inlined images. Other formats such as PostScript cause a new window to be displayed with the contents of the
file displayed in the new window. Therefore, in order to store a research paper that contains embedded figures
and tables, the following conversions are performed: (a) documents originally written in I4TjX2 are converted
into HTML using the WIJX2HTML conversion program, (b) figures included in such documents are converted
from PostScript format into GIF format using a pstogif convertor, and then made "transparent" , if required,
using a transgif convertor which also translates the image from GIF87a format to GIF89a format, (c) tables are
converted into HTML Table format (which can be displayed by Netscape Navigator version 3.0).

Among the representation schemes that we are investigating for future enhancements to the digital library
is an object-oriented database model (using a database system such as fliustra) in which the information in the
digital library will be stored as a hierarchical set of objects. Each object will contain the data associated with
that object as well as functions that determine the category of the information and the tools required to present
the information to the user. Thus, all the data conversion and interpretation operations for an object will be
done by the functions associated with that object, thus making the interface transparent to not only the user but
also to all the higher levels of objects in the library.

7. DOCUMENT BROWSER APPLICATION

Being a research center that conducts research in document analysis and recognition, we naturally incorporate
document analysis and recognition capabilities into our digital library. We are conducting research into extracting
relevant information from the scanned raster image of a document in response to user needs. The digital library
contains many scanned papers that are stored in digitized raster form. The objective is to do some or all of
the following on a given document image: (a) perform layout analysis on the document to find all the text and
graphics regions; (b) label all the text regions according to their identities such as "title" , "footnote" , etc.; (c)
identify basic logical linkages such as that between a photograph and its caption; (d) group related regions into
logical units such as a "newspaper story" ; (e) determine the reading order of the text blocks within a logical unit;
and (f) selectively read the text in a given logical unit. We are developing a document browser capability into
the digital library which will perform the above functions. This would enable the user to extract any relevant
information while browsing a document.

A "document browser" application is being developed based on this approach. The browser aJlows a user
to interactively specify queries on the documents in the digital library using a graphical user interface, provides
feedback on the candidate documents and their properties at each stage of the query and retrieval process, and
allows refinements of the query based on the intermediate results of the search. Results of a query can be
displayed either as an image or as formatted text, and the browser allows the user to specify the level of detail in
the displayed results.

8. RESULTS

Experimental results for this approach have been very encouraging. A large proportion of CEDAR researchers
utilize the digital library to obtain information on techniques and related material relevant to their research.
For example, researchers from various project groups have been using the digital library for information on
handwritten word recognition, which is used in several of our current projects, as well as for archived material
on OCR (optical character recognition) algorithms, which are used in CEDAR's character recognition engines
to read addresses on envelopes, names on forms, etc. As the material in the digital library has become more
comprehensive, and its organization and retrieval capabilities have become more extensive, usage has grown to
the point that the digital library has become the one centralized resource to which all CEDAR researchers can

216
turn for a variety of comprehensive, accurate and up-to-date information related to past and current CEDAR
research.

The DeLoS system, which is used to identify parts of documents to be retrieved, has been trained using data
from page images of various newspaper and journal pages, and extensively tested on data from digitized images
of newspaper pages.

The DeLoS system has been tested on a variety of newspaper pages (e.g., The Buffalo News, USA Today).
Overall, the DeLoS system performed fairly well for images from newspapers. Figure 2 shows the performance of
the system for the USA Today newspaper pages, in terms of percentages of the original blocks correctly classified,
grouped and read-ordered. it also shows the percentages of correctly segmented blocks that are correctly classified,
grouped and read-ordered. Performance results for The Buffalo News followed a similar pattern (and have been
described in detail in5). As we can see from the table in Figure 2, block segmentation and block type-categorization
in the original images proved to be a deciding factor in the performance of the system.

Document II Block Block Read-

[LID Classif. Grouping Ordering
PageOl 96.1 % 73 % 72.7%
PageO2 96.7 % 80.6 % 92.3 %
PageO3 85.1 % 100 % 100 %
PageO4 84.2 % 73.6 % 87.5 %
PageO5 91.3 % 82.6 % 71.4 %
PageO6 85.1 % 66.6 % 80 %
PageO7 88.4 % 84.6 % 100 %
PageO8 100 % 94.4 % 100 %
PageO9 70.3 % 85.1 % 57.1 %
Pagell 100 % 97.1 % 86.6 %
Pagel2 82.6 % 69.5 % 83.3 %
Pagel3 84 % 48 % 75 %

BLOCK CLASSIF. = 'h of blocks correctly classified

BLOCK GROUPING = % of blocks correctly grouped
READ—ORDERING = 'h of text blocks correctly read—ordered

Figure 2: Performance of DeLoS on pages of USA Today.

9. CONCLUSIONS

We have presented an approach to retrieving desired components of a document in response to user queries.
The structure of a document structure analysis system (DeLoS) which enables this, was also presented. This
approach allows a user to retrieve logical components of a document (e.g., a newspaper article, a citation list, a
photograph, a chart, etc.). It also provides the ability to perform multi-stage index building based on the query,
so that only those documents in the library that are candidates for satisfying a given query have to be completely
analyzed, thus saving considerable search time.

We have also presented the design and implementation of Taxila, the CEDAR digital library for research
on document analysis and recognition. This digital library is being used extensively by CEDAR researchers to

217
retrieve material of interest.

The CEDAR digital library is a prototype library for multimedia research information, and can be used as a
model for WWW-based information repositories. It is possible for heterogeneous research information sources in
non-local Web sites to be linked to the Taxila structure, thereby contributing to more efficient and effective sharing
of research information among researchers at distributed locations. In addition, Taxila acts as an interactive
document analysis and recognition tool for researchers, so that parts of a document can be selectively analyzed,
and this not only furthers the state-of-the-art in this dynamic research area, but also provides a logical linkage
between the areas of document understanding and digital library technology. Thus, the Taxila should provide a
resource for those interested in the processes of creating digital document libraries as well as automating data
entry from paper.

10. REFERENCES
1. S.W. Lam and S.N. Srihari. Multi-domain document layout understanding. In Proceedings of ICDAR—91,
pages 112—120, Saint-Malo, France, Sept. 30—Oct. 2, 1991.
2. L. Lamport. LaTeX: A Doc'ument Preparation System. Addison-Wesley, 1986.
3. M. Landy, Y. Cohen, and G. Sperling. HIPS: A Unix-based image processing system. Computer Vision,
Graphics, and Image Processing, 25:331—347, 1984.
4. A.M. Nazif and M.D. Levine. Low level image segmentation: An expert system. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-6(5):555—577, September 1984.
5. D. Niyogi. A Knowledge-Based Approach to Deriving Logical Structure from Document Images. PhD thesis,
State University of New York at Buffalo, 1994.
6. D. Niyogi and S.N. Srihari. An integrated approach to document decomposition and structural analysis.
International Journal of Imaging Systems and Technology, 7:330—342, 1996.
7. D. Niyogi and S.N. Srthari. A rule-based system for document understanding. In Proceedings of AAAI-86,
volume 2, pages 789—793, Philadelphia, PA, August 15—22, 1986.
8. D. Niyogi and S.N. Srihari. Knowledge-based derivation of document logical structure. In Proceedings of
ICDAR '95 (Third International Conference on Document Analysis and Recognition), Montreal, Canada,
August 1995.
9. D. Niyogi, S.N. Srihari, and V. Govindaraju. Analysis of printed forms. In H. Bunke and P.S.P. Wang, editors,
Handbook on Optical Character Recognition and Document Image Analysis. World Scientific Publishing Co.,
Singapore, 1996.
10. R.K. Srihari. PICTION: A system that uses captions to label human faces in newspaper photographs. In
Proceedings of AAAI-91, pages 80—85, Anaheim, CA, 1991.
11. S.N. Srihari, S.W. Lam, J.J. Hull, R.K. Srihari, and V. Govindaraju. Intelligent data retrieval from raster
images of documents. In Proceedings of Digital Libraries '9 (The First Annual Conference on the Theory
and Practive of Digital Libraries), College Station, TX, June 1994.

218

Layout Based Information Retrieval From Document Images: D.Shobana, M.SC (I.T) ., M.Phil (C.S) ., (PH.D)
No ratings yet
Layout Based Information Retrieval From Document Images: D.Shobana, M.SC (I.T) ., M.Phil (C.S) ., (PH.D)
5 pages
Automatic Article Extraction in Old Newspapers Digitised Collections
No ratings yet
Automatic Article Extraction in Old Newspapers Digitised Collections
6 pages
A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
No ratings yet
A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
5 pages
Simple Table Detection in Documents
No ratings yet
Simple Table Detection in Documents
11 pages
Document Layout Analysis
No ratings yet
Document Layout Analysis
36 pages
User-Defined Template For Identifying Document Type and Extracting Information From Documents (1999) - 34
No ratings yet
User-Defined Template For Identifying Document Type and Extracting Information From Documents (1999) - 34
4 pages
Segmenetaion of Unstructured Newspaper Documents
No ratings yet
Segmenetaion of Unstructured Newspaper Documents
5 pages
Automatic Document Processing - A Survey (1996) - 209
No ratings yet
Automatic Document Processing - A Survey (1996) - 209
22 pages
Word Image Coding for Document Retrieval
No ratings yet
Word Image Coding for Document Retrieval
14 pages
2025 Graph-Based Document Structure Analysis
No ratings yet
2025 Graph-Based Document Structure Analysis
24 pages
A Complete Analysis of Document in Image Processing: Shaik Abdul Subhan
No ratings yet
A Complete Analysis of Document in Image Processing: Shaik Abdul Subhan
8 pages
Kon OJ
No ratings yet
Kon OJ
14 pages
Aaai86 131
No ratings yet
Aaai86 131
5 pages
Layoutand Content Extractionfor PDFDocuments
No ratings yet
Layoutand Content Extractionfor PDFDocuments
12 pages
Document Layout Analysis with Edge Embedding
No ratings yet
Document Layout Analysis with Edge Embedding
11 pages
University of Gondar: Document Image Retrieval
No ratings yet
University of Gondar: Document Image Retrieval
9 pages
AIDAS: Incremental Logical Structure Discovery in PDF Documents
No ratings yet
AIDAS: Incremental Logical Structure Discovery in PDF Documents
5 pages
Structuring Documents According To Their Table of
No ratings yet
Structuring Documents According To Their Table of
9 pages
A Brief Review of Document Image Retrieval Methods: Recent Advances
No ratings yet
A Brief Review of Document Image Retrieval Methods: Recent Advances
8 pages
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
No ratings yet
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
5 pages
Morphological Operations For Image Processing: Understanding and Its Applications
No ratings yet
Morphological Operations For Image Processing: Understanding and Its Applications
4 pages
An Automated Technique To Recognize and Extract Images From Scanned Archaeological Documents
No ratings yet
An Automated Technique To Recognize and Extract Images From Scanned Archaeological Documents
6 pages
A Machine-Learning Approach For Analyzing Document
No ratings yet
A Machine-Learning Approach For Analyzing Document
15 pages
Lecture5 Morphology
No ratings yet
Lecture5 Morphology
81 pages
Layout Analysis For Arabic Historical Document Images
No ratings yet
Layout Analysis For Arabic Historical Document Images
6 pages
Ocr Bulgaria Paper
100% (1)
Ocr Bulgaria Paper
11 pages
Docbank: A Benchmark Dataset For Document Layout Analysis
No ratings yet
Docbank: A Benchmark Dataset For Document Layout Analysis
12 pages
A Robust Algorithm For Text String Separation From Mixed Text/Graphics Images
No ratings yet
A Robust Algorithm For Text String Separation From Mixed Text/Graphics Images
9 pages
NCVSComs
No ratings yet
NCVSComs
4 pages
Pp-Structurev2: A Stronger Document Analysis System
No ratings yet
Pp-Structurev2: A Stronger Document Analysis System
8 pages
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
No ratings yet
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
14 pages
Layout Similarity
No ratings yet
Layout Similarity
18 pages
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
No ratings yet
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
8 pages
Mathur LayerDoc Layer-Wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents WACV 2023 Paper
No ratings yet
Mathur LayerDoc Layer-Wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents WACV 2023 Paper
11 pages
R16 4-1 Dip Unit 5
No ratings yet
R16 4-1 Dip Unit 5
33 pages
Dip R20 Unit-5 Notes
No ratings yet
Dip R20 Unit-5 Notes
50 pages
A Table Detection Method For Multipage PDF Documents Via Visual Seperators and Tabular Structures
No ratings yet
A Table Detection Method For Multipage PDF Documents Via Visual Seperators and Tabular Structures
5 pages
2019 ICDAR PRiba
No ratings yet
2019 ICDAR PRiba
6 pages
IJERT Segmentation and Detection of Text
No ratings yet
IJERT Segmentation and Detection of Text
6 pages
A Robust Framework For One-Shot Key Information Extraction Via Deep Partial Graph Matching
No ratings yet
A Robust Framework For One-Shot Key Information Extraction Via Deep Partial Graph Matching
10 pages
Image Processing 5
No ratings yet
Image Processing 5
50 pages
IPT Module 5
No ratings yet
IPT Module 5
12 pages
Deep Learning-Based Detection of One and Two-Column Textual Blocks in Camera-Captured Pashto Documents Images
No ratings yet
Deep Learning-Based Detection of One and Two-Column Textual Blocks in Camera-Captured Pashto Documents Images
10 pages
Document Analysis of PDF Files: Methods, Results and Implications
No ratings yet
Document Analysis of PDF Files: Methods, Results and Implications
15 pages
4 14755 CS213 20172018 1 2 1 Lecture 5
No ratings yet
4 14755 CS213 20172018 1 2 1 Lecture 5
56 pages
Morphology Students
No ratings yet
Morphology Students
14 pages
Methodology For Eliminating Plain Regions From Captured Images
No ratings yet
Methodology For Eliminating Plain Regions From Captured Images
13 pages
M2Onto: An Approach and A Tool To Learn Owl Ontology From Mongodb Database
No ratings yet
M2Onto: An Approach and A Tool To Learn Owl Ontology From Mongodb Database
10 pages
1 SM
No ratings yet
1 SM
11 pages
Morphological Image Processing - 12thmarch2023
No ratings yet
Morphological Image Processing - 12thmarch2023
47 pages
Chowdhury 2007
No ratings yet
Chowdhury 2007
5 pages
Computer Vision 3
No ratings yet
Computer Vision 3
3 pages
What Is A Document - JASIS 1997
No ratings yet
What Is A Document - JASIS 1997
8 pages
Type Modelling For Document Transformation in Structured Editing Systems
No ratings yet
Type Modelling For Document Transformation in Structured Editing Systems
29 pages
Reference
No ratings yet
Reference
4 pages
An Issue-Oriented Syllabus Retrieval System Based On Terminology-Based Syllabus Structuring and Visualization
No ratings yet
An Issue-Oriented Syllabus Retrieval System Based On Terminology-Based Syllabus Structuring and Visualization
12 pages
Document Analysis System: Wong G. Casey Wahl
No ratings yet
Document Analysis System: Wong G. Casey Wahl
10 pages
Software 03 00010 v2
No ratings yet
Software 03 00010 v2
20 pages
Paper 2
No ratings yet
Paper 2
9 pages
Homeland Security Research and Development Funding
No ratings yet
Homeland Security Research and Development Funding
7 pages
Stanag 45691
No ratings yet
Stanag 45691
54 pages
An Assessment of National Infrastructure Strategy
No ratings yet
An Assessment of National Infrastructure Strategy
20 pages
Military Force As An Element of National Power
No ratings yet
Military Force As An Element of National Power
20 pages
Kinetic Energy Kill For Ballistic Missile Defense
No ratings yet
Kinetic Energy Kill For Ballistic Missile Defense
11 pages
Astm 2025 06 03
No ratings yet
Astm 2025 06 03
4 pages
Astm 2025 05 29
No ratings yet
Astm 2025 05 29
3 pages
Hume: Domain-Agnostic Extraction of Causal AD1189441
No ratings yet
Hume: Domain-Agnostic Extraction of Causal AD1189441
35 pages
M2589 The Potential of The NAVSTAR Global Positioning System For The Corps of Engineers, Civil Works
No ratings yet
M2589 The Potential of The NAVSTAR Global Positioning System For The Corps of Engineers, Civil Works
9 pages
M2584 Test Evaluation of The Honeywell GG 111 Single-Degree-Of-Freedom Isof) 00 Strapdown Gyroscope
No ratings yet
M2584 Test Evaluation of The Honeywell GG 111 Single-Degree-Of-Freedom Isof) 00 Strapdown Gyroscope
35 pages
A Study of Highly Underexpanded Supersonic Jets N210024918
No ratings yet
A Study of Highly Underexpanded Supersonic Jets N210024918
20 pages
NGVA Alignment with LAVOSAR II
No ratings yet
NGVA Alignment with LAVOSAR II
14 pages
Semantic Error Detection in Dialogues
No ratings yet
Semantic Error Detection in Dialogues
6 pages
DDS Military Use Case 190926
No ratings yet
DDS Military Use Case 190926
31 pages
Modeling of Multi-Photon Excitations in Charge-Transfer Materials
No ratings yet
Modeling of Multi-Photon Excitations in Charge-Transfer Materials
3 pages
Wohlers22 Eb
100% (1)
Wohlers22 Eb
426 pages
End User Development of Digital Collection Mash-Ups - A Survey To Assess The Suitability of Current Infrastructure
No ratings yet
End User Development of Digital Collection Mash-Ups - A Survey To Assess The Suitability of Current Infrastructure
10 pages
Aik'F - ?G0 Ss7 - G": Ucrl-92160 Preprint
No ratings yet
Aik'F - ?G0 Ss7 - G": Ucrl-92160 Preprint
6 pages
SEA TECH March 2021
No ratings yet
SEA TECH March 2021
41 pages
STANAG
No ratings yet
STANAG
18 pages
SEA TECH June 2021
No ratings yet
SEA TECH June 2021
50 pages
SEA Technology Nov 2021
No ratings yet
SEA Technology Nov 2021
49 pages
Radiation Exposure and Performance of Multiple Burn LEO-GEO Orbit Transfer Trajectories
No ratings yet
Radiation Exposure and Performance of Multiple Burn LEO-GEO Orbit Transfer Trajectories
15 pages
SEA Technology Feb 2022
0% (1)
SEA Technology Feb 2022
45 pages
Differential Correction of Orbits by Kepler Versus Cartesian Parameters
No ratings yet
Differential Correction of Orbits by Kepler Versus Cartesian Parameters
17 pages
Eee Hhe 'El".N: 7hheeehlh
No ratings yet
Eee Hhe 'El".N: 7hheeehlh
33 pages
Aerodynamic Evaluation
No ratings yet
Aerodynamic Evaluation
18 pages
Ijrsp 23 (5) 299-312 PDF
No ratings yet
Ijrsp 23 (5) 299-312 PDF
14 pages
Is PD - Promoted S-Bond Metathesis Mechanism Operative For The PD PEPPSI Complex-Catalyzed Amination of Chlorobenzene With Aniline? Experiment and Theory
No ratings yet
Is PD - Promoted S-Bond Metathesis Mechanism Operative For The PD PEPPSI Complex-Catalyzed Amination of Chlorobenzene With Aniline? Experiment and Theory
9 pages
WT Lab Manual: Overview Object Web Technologies
No ratings yet
WT Lab Manual: Overview Object Web Technologies
83 pages
Introduction To MySQL Triggers
No ratings yet
Introduction To MySQL Triggers
5 pages
HANA Traces PerformanceTrace 2.00.040+
No ratings yet
HANA Traces PerformanceTrace 2.00.040+
3 pages
Dba Answers
No ratings yet
Dba Answers
8 pages
Module 1 of Statistics 27-30
No ratings yet
Module 1 of Statistics 27-30
11 pages
Chapter 7 Big Data
No ratings yet
Chapter 7 Big Data
7 pages
Mid Term 2 PLSQL
No ratings yet
Mid Term 2 PLSQL
23 pages
Question Bank DBMS I
No ratings yet
Question Bank DBMS I
11 pages
Entity Framework: EF 6 Code First
No ratings yet
Entity Framework: EF 6 Code First
45 pages
Visual Analytics With Tableau
No ratings yet
Visual Analytics With Tableau
11 pages
Oracle 12c Installation & Admin Exam Guide
100% (1)
Oracle 12c Installation & Admin Exam Guide
7 pages
Web-Based Employee Attendance System Development U
No ratings yet
Web-Based Employee Attendance System Development U
13 pages
Xiiprquessolu
No ratings yet
Xiiprquessolu
26 pages
Lecture 01 Intro
No ratings yet
Lecture 01 Intro
31 pages
PRAVEEN
No ratings yet
PRAVEEN
10 pages
Comprehensive Agentic AI v2.0 Learning Roadmap
No ratings yet
Comprehensive Agentic AI v2.0 Learning Roadmap
37 pages
Insurance
No ratings yet
Insurance
18 pages
P6 - GGY283 - 2021 - Instructions and Notes
No ratings yet
P6 - GGY283 - 2021 - Instructions and Notes
23 pages
2019-12 Classification of Pruning Methodologies For Model Development Using Data Mining Techniques
No ratings yet
2019-12 Classification of Pruning Methodologies For Model Development Using Data Mining Techniques
5 pages
Mongodb Schema Design Part 1
No ratings yet
Mongodb Schema Design Part 1
1 page
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
SnowProCore Exam Study Guide 072624 PDF
No ratings yet
SnowProCore Exam Study Guide 072624 PDF
17 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Topic 1 Conditional Formatting
No ratings yet
Topic 1 Conditional Formatting
5 pages
Vinay Rao CV
No ratings yet
Vinay Rao CV
6 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
v6 Data Sheet - Date and Time Handling
No ratings yet
v6 Data Sheet - Date and Time Handling
7 pages
SQL Exam for Database Students
No ratings yet
SQL Exam for Database Students
2 pages
IBM Watson Analytics Automating Visualization Desc
No ratings yet
IBM Watson Analytics Automating Visualization Desc
12 pages
LinkTransformer for Easy Record Linkage
No ratings yet
LinkTransformer for Easy Record Linkage
16 pages

Document Structure Analysis

Uploaded by

Document Structure Analysis

Uploaded by

The use of document structure analysis to retrieve

information from documents in digital libraries

SPIE Vol. 3027 • 0277-786X/971$10.00

3. DOCUMENT CONTENT RETRIEVAL

4.1. System Organization

Figure 1: The DeLoS system.

4.2. Rule-Based Control Structure

4.2.1. The Knowledge Base

4.2.2. The Inference Engine

6. DIGITAL LIBRARY ARCHITECTURE

6.1. Organization of Taxila

. Image databases used by CEDAR researchers

6.1.4. Directory Archives

6.1.5. Scanned Papers

6.1.6. Slide Shows

6.1.7. Source Code

6.2. System Implementation

6.3. Representation of Information

Type of document Supported formats

Table 1: Data formats supported by Taxila.

7. DOCUMENT BROWSER APPLICATION

Document II Block Block Read-

BLOCK CLASSIF. = 'h of blocks correctly classified

Figure 2: Performance of DeLoS on pages of USA Today.

You might also like