Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views6 pages

Historical Documents As Monuments and As Sources

The document discusses the development of a system for managing historical documents digitally. The system allows scholars to study documents without handling originals, preserving the documents. It supports functions like recording documents digitally, transcribing and translating text, semantic indexing, and retrieval. The system was designed using a document model and thesaurus to classify documents accurately and allow efficient searching. It was developed and tested on historical archives from Greece.

Uploaded by

ba3jar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views6 pages

Historical Documents As Monuments and As Sources

The document discusses the development of a system for managing historical documents digitally. The system allows scholars to study documents without handling originals, preserving the documents. It supports functions like recording documents digitally, transcribing and translating text, semantic indexing, and retrieval. The system was designed using a document model and thesaurus to classify documents accurately and allow efficient searching. It was developed and tested on historical archives from Greece.

Uploaded by

ba3jar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Historical documents as monuments and as sources

Panos Constantopoulos, Martin Doerr, Maria Theodoridou, Manolis Tzobanakis

Institute of Computer Science


Foundation for Research and Technology Hellas
{panos, martin, maria, tzoban}@ics.forth.gr

Abstract. The main functions of a digital library of historical documents are the digital recording of documents, transcriptions and
translations of documents, subject indexing, annotation and retrieval. Using such a system, scholars can efficiently study the
documents without involvement of the originals, thus ensuring the preservation of documents and the protection of researchers from
exposure to potential health hazards. An important part of the study of historical documents consists of classifying the material and
annotating it in such a way that retrieval is facilitated in the future.
We present the development of a historical document management system that supports both digital library functionality and
archival management of the original documents. This system includes semantic indexing and multifaceted classification of
historical documents with the use of a built-in thesaurus aimed at attaining satisfactory levels of efficiency of the classification
process, completeness and precision of the retrieved information, and user-friendliness.

Key words: digital libraries, historical documents, subject classification

1 Introduction thematic indexing. An important part of the study of


historical manuscripts consists of classifying the
Along with the expansion of computer systems, material and annotating it in such a way that retrieval
networks and of the Internet in particular, has come a is facilitated in the future.
rapid expansion of digital libraries. With the term The documentation and management of source
digital libraries we refer to collections of documents material and monuments are commensurate and of the
which have been stored and are accessible same importance. In an information management
electronically together with a set of relevant services. approach the two views are both notionally and
The management of historical documents which have functionally interconnected. In this spirit, we
significant cultural and historical value and may have developed a historical document management system
undergone varying degrees of deterioration gains that comprises a digital library of historical documents
increasing interest in the area of digital libraries [1]. and supports the main functions of recording digital
Historical documents can be viewed in two historical documents, transcriptions and translations of
perspectives, that of monuments and that of source documents, subject indexing, annotation and retrieval.
material. On one hand, viewing the documents as Using such a system, scholars can efficiently use and
monuments, the objective is to support the creation of study the documents without involvement of the
a digital archive that will ensure preservation of the originals, thus ensuring the preservation of documents
originals and facilitate the management of the physical and the protection of researchers from exposure to
archive using consistent, appropriate archival potential health hazards. The system supports semantic
catalogue entries. Such a digital library should provide indexing and multifaceted classification of historical
the entire spectrum of functions for the various stages documents with the use of a built-in thesaurus. A main
in the treatment of historical documents: document feature of the system is its rich and extensible
acquisition, archival and management, cataloguing and document model. Our aim during the design and
annotation, processing and transmission, like in development of the system was to attain satisfactory
museum information systems. These functions serve a levels of efficiency of the classification process,
variety of purposes, such as supporting the completeness and precision of the retrieved
preservation, documentation, study and management of information, and user-friendliness. The system
historical documents. They also protect people from supports remote access to the digital library through
exposure to potential health hazards and assist in the the Web, allowing users to work in a familiar way,
production and dissemination of electronic versions of using their own preferred environment independently
publications and exhibits, which promote cultural of their platform. The implementation of the system is
education. based on Java for the client side and on the Semantic
Index System [2] for the server side.
On the other hand, we can view documents as source
material and as such we need to manage content Two important collections with significant cultural and
descriptions, transcriptions, translations and provide historical value have provided material for the
development of the historical document and archive
management system. In the context of project and the system should be able to fill in automatically
ARCHON - A Multimedia System for Archival, information that is stored already in or can be derived
Annotation and Retrieval of Historical Documents we from the knowledge base of the system. Additionally,
investigated the classification and archival of the the system should follow the context and the user
Turkish Archive of Heraklion, the Municipal Archive profile and offer automatically the most probable
of Heraklion and the Venetian Archive which options.
comprise the historical archives of the Vikelea
Municipal Library of Heraklion, dated from the late
1600s to early 1900s. The other project concerned the 3 Building a historical document
Turkish Archive of Chania. classification and management system

The archival and management of documents in the


2 Subject classification digital library are driven by a model of documents,
archive organization and processes, which take into
An important part of the study of historical documents account the ISAD (G) General International Standard
consists of classifying the material and annotating it in Archival Description [6], the EAD Encoded Archival
such a way that retrieval is facilitated. The state-of-the- Description Document Type Definition [7] and the
art OCR software is inaccurate on all but the most Dublin Core Metadata Elements Set [8].
uniformly printed documents (let alone manuscripts)
requiring proofreading and error correction. Thus, the The model distinguishes between the historical
automatic transcription of historical manuscripts is not (original) and the current organization of the material
possible and we can only rely on manual techniques. and keeps the correlations between the two
An obvious approach to the document classification organizations (Figure 1). This distinction allows an
problem would be the implementation of a keyword integrated view of material that is copied or scattered
system. Manual keyword assignment is a time in different physical locations. For example, the
consuming process and historians, scholars and other Venetian archives are physically located in Italy, while
researchers that study the documents are not willing to the Vikelea Municipal Library maintains microfilm
spend too much time to input and classify the copies of the same material.
documents. Moreover, keywords are not necessarily The organization of the archives comprises Fonds,
unique and we may easily end up using keywords from Subfonds, Units of description (series, books, files,
at least three different vocabularies: the vocabulary of pages, sheets etc.), according to the ISAD (G)
the author(s) of the documents, the vocabulary of the terminology, and we distinguish between the physical,
cataloguer, indexer, or classifier of the document and the conceptual and the electronic material providing
the vocabulary of the searcher. As these vocabularies the correlations among them (Figure 2). For example, a
have evolved over different time periods, it is page is a physical unit that may contain more than one
generally not easy to create satisfactory mappings documents (conceptual units). Finally, we have
between them. In addition the risk of mismatch in modeled actions affecting the material such as editing
transitions from one vocabulary to another is quite or scanning and states during material processing that
high. An approach to this problem is to identify will facilitate the streamlining of some parts of the
concepts, rather than words, in a given knowledge document input, classification and retrieval user
domain, which, organized in term thesauri, can provide interface (Figure 3).
consistent classification, better retrieval precision,
preservation of the identity of concepts and a domain Subject classification of historical documents is based
knowledge base. on five facets which represent distinct concept
categories related to information about, as well as
Historical documents may be classified through information contained in the documents.. Important
different procedures. One scenario is that an indexer is elements about or in a document are actors, dates,
assigned to classify a specific corpus of material. A places, purposes, objects and their names. There exists
second scenario is that a researcher studying a specific significant information regarding document creation,
topic assigns classification information to the such as who wrote the document, to whom it was
documents that he encounters during his study. addressed, when and where it was written, for what
Scholars classify documents according to their purpose it was written, what it quotes. Moreover,
interests of study and we can never assume that the scholars are interested in information quoted in the
classification of a document or of an entire archive is document. This includes significant actions or
complete. We essentially have to deal with an open activities mentioned in the document such as what
world. Thus, it is important to provide an easy to use, activity is described, who is involved, where and when
straightforward and efficient user interface that will it took place, what objects were involved. In other
enable precise and fast classification of the documents. words the questions that have to be answered are
The process of classification itself should be intuitive Who?, Where?, When?, What? and How?.
derived from (d) structural (s) corresponding (c)

classification
Archival Description
generalization
attribute
Archival Type Physical Conceptual
Archival Type Archival Type

belongs to (s)
Fonds
subfonds of (s)

Subfonds Item
copy of (d)

Film Archive Current Fonds Historical Fonds

now kept in (s)


part of (c) initially kept in (s)
copy of part (d) originates from (c)
kept in (s) Unit of
Description
Current Historical
Subfonds Subfonds

Figure 1: Modeling collections of historical documents

derived from ( d) structural (s) corresponding ( c)

Archival Description

Physical Conceptual
Archival Type
Archival Type Archival Type

now kept in ( s) initially kept in ( s)


Item
Unit of Description
corresponds to ( c)

Series Item Unit


Document
classification Picture
generalization Page
attribute
Series of Binded Documents
Photograph

copy of (d)
Microfilm Series of Loose Documents Sheet Shot
contains (s)

Figure 2: Modeling objects versus contents

Occurence history

DescriptionType EventType
result
derived_from structural corresponding

ArchivalDescription
ActionType

ArchivalType PhysicalArchival ConceptualArchival ElectronicDocum ent ElectronicProcessing


Type Type Type Type

Fonds UnitOfDescription Item ElectronicDocum ent ElectronicProcessing


product

ItemUnit
classification
generalization corresponds_to Editing Scanning
attribute SheetPage Document Picture

produced_from Transcription
Translation
produced_from ScannedPage

Figure 3: Modeling the electronic documents


Faceted schema

Fact data base

Document Digital Library

Figure 4:Concept -based faceted classification of historical

Actors Activities Objects


(persons and organizations)
Faceted schema
invokes refers to

Pasha Purchase House


Fire

Fact data base Omer Pasha Velis House


Purchase Deed 43
Fire 1658
Document Digital Library

Figure 5:Concept-based faceted classification of historical documents: An

It is common practice by scholars to assign an annotation search and recognition. Classification takes place at
to documents while they study or translate documents. In type/class level or at "instance" level where we have
our approach, we base classification on the analysis of the references to correlated real things. Simultaneous
original (prototype) text or of its annotation, description classification from independent aspects is also possible.
or translation. We provide a formal syntactic structure to The system builds up knowledge about the domain,
assign a combination of terms from different facets for which may be helpful during retrieval. The maintenance
each document, which result in a precise characterization of such a semantic net can be assisted by a full text
of the document according to the set of criteria indexing facility. In Figure 5, we present an example of
represented by the facets. The facets constitute an document classification and the knowledge building
extensible structure of hierarchical catalogues. In contrast process: A document refers to the purchase of Velis
with keyword systems, the index terms are typed by their house by Omer Pasha. Another document refers to a fire
facet and interrelated, forming a semantic network with that took place in 1658 during which Velis house was
unique identity and associated meaning thus avoiding burned. A scholar interested in documents concerning
ambiguities (Figure 4). Hierarchical structures and Omer Pasha might consider the document about the fire
especially polyhierarchies, as in our case, facilitate fast as relevant. Our system has built a link between the two
documents and thus a query on Omer Pasha would and payment transactions. In several cases an object may
retrieve both, although there is no direct reference of be identified through an activity or action.
Omer Pasha in the second document neither has it been
classified having Omer Pasha as actor.
We now briefly present the five facets defined in the 4 Example of use
system:
In this section we will give a brief presentation of the
Persons and Organizations historical document management system that we
This facet groups the agents - either individuals, groups developed. The system supports semantic indexing and
of people or organizations - that are involved in the multifaceted classification of historical documents with
creation of the document or referred in it. The facet is the use of a built-in thesaurus. The subject classification
organized according to types of agents i.e. professions, functions are built using the Semantic Index System [2],
social/religious casts, administrative roles, origin etc. For a general purpose semantic network information
efficient retrieval based on names we keep a separate management system, developed by the Information
index of names with references to documents and Systems Laboratory of ICS-FORTH. Document querying
additional information such as name of father, mother, and retrieval are done through a Java/HTML User
origin, date. Interface, which communicates with the SIS through a
Java Database Connectivity Driver (JDBC) [3].
The identification of a person may be difficult in several
cases. We make the assumption that two persons are Two important collections with significant cultural and
different unless proven differently. historical value have provided test material for the
development of the historical document and archive
management system. The first consists of the Turkish
Activities and actions Archive of Heraklion, the Municipal Archive of
Heraklion and the Venetian Archive which comprise the
This facet groups the purposes and kinds of documents. historical archives of the Vikelea Municipal Library of
For activities and actions it is better to do the Heraklion, dated from the late 1600s to early 1900s. The
classification according to their types and not the second consists of the Turkish Archive of Chania.
instances themselves. We investigate the possibility of
using an existing thesaurus, such as SHIC [5], for this The three above-mentioned historical archives of the
purpose. Vikelea Municipal Library comprise approximately
1,500,000 pages of manuscripts. As a test for the digital
archive of our historical document management system,
Places we proceeded with the digitization of about 80,000 pages
of the archives. These documents were scanned to an
This facet groups the places referred in a document or the electronic form, processed for image correction [4] and
place it was created. Types under this facet include archived in our digital library. The users access the digital
natural division (e.g. mountain, lake, river, valley), library through client programs that have been developed
administrative division (e.g. prefecture, city, village) or using Java and WWW technology. Java allows the same
buildings (e.g. monastery, church). code to be executed on different platforms. To ease
An interesting issue concerning this facet is how to installation and maintenance we have built client
support associative search, for example how to identify programs to execute as Java applets inside standard
"Martins house". Internet browsers (Netscape, Internet Explorer). To
represent the information on the users monitor a
graphical user interface similar to the well-known
Time Microsoft Windows Explorer GUI was created.

This facet groups the chronological references made in or The user interface caters for both the monument and the
attributed to documents. source nature of documents during retrieval. The first is
through the use of the archival catalogue. The archival
In the case of dates recall is often considered more catalogue entries used by the library to identify the
important than precision in document retrieval, thus an documents in their physical location are preserved in the
interesting issue is how to provide an intelligent digital library and can be used for fast retrieval of
browser based on a dynamic temporal index for events documents by those users who are familiar with the
and a historical clock. physical organization of the archives.
The second and most interesting way of retrieving
Objects documents is by formulating document queries regarding
the document content. Query formulation is achieved by
This facet groups the objects referenced in documents. selecting and combining terms from the five facets that
Types include movable/fixed objects, monetary systems have been defined. The terms of each facet are displayed
to the user as expandable tree structures similar to the 5 Conclusions
trees of Microsoft Windows Explorer (Figure 6).
We have presented a management system for historical
documents that supports semantic indexing and
multifaceted classification of historical archives with the
use of a built-in thesaurus. Documents are treated both as
monuments, to be preserved and managed, and as sources
of information content, collected, inter-linked and
managed in a digital library. This system has been tested
using archives of historical documents of the Vikelea
Municipal Library of Heraklion and of the Turkish
Archive of Chania.
Future work includes the development of an annotation
system, the support for the development of specialized
vocabularies per application, and the provision of Web
accessibility.

Figure 6: Document Management System Query References


Formulation
[1] Digital Libraries: Future Research Directions for a
European Research Programme, June 13-15, 2001,
For each facet the user may select one or more terms San Cassiano (Dolomites), Italy, http://delos-
which are automatically combined with the AND noe.iei.pi.cnr.it/activities/researchforum/Brainstormin
operator. The query is thus formulated in steps and the g/brainstorming-report.pdf
user has an overall view of his selections at any point of [2] http://www.ics.forth.gr/isl/r-d-activities/sis.html
the formulation process. The result of the query execution [3] http://www.ics.forth.gr/isl/manuals/api.doc
appears as a set of thumbnails that represent the digitized [4] A Visual Tagging Technique for Annotating Large-
documents that matched the query. By clicking on the Volume Multimedia Databases. K.V. Chandrinos, J.
thumbnails the user can retrieve the image of the original Immerkaer, Martin Doerr, P.E. Trahanias, 5th DELOS
document as well as its translation, if one is available Workshop on Filtering and Collaborative Filtering
(Figure 7). Additionally, the user may ask to view the [5] http://www.holm.demon.co.uk/shic.htm
classification information of a retrieved document, which [6] ISAD (G) General International Standard Archival
might be useful in selecting new appropriate terms for Description: http://www.ica.org/
query reformulation. [7] EAD Encoded Archival Description
http://lcweb.loc.gov/ead/
[8] Dublin Core Metadata Elements Set
http://dublincore.org

Figure 7: Document Management System - Retrieval

You might also like