1
USERS GUIDE
ASIAN OCR ADD-ON
FOR READIRIS
asian.PMD
20/02/2006, 10:36
1990-2006 I.R.I.S. All rights reserved
OCR technology by I.R.I.S.
Connectionist, AutoFormat and Linguistic technology by I.R.I.S.
1990-2006 I.R.I.S. All rights reserved
asian.PMD
20/02/2006, 10:36
USERS GUIDE
I NTRODUCING
THE
A SIAN OCR A DD-ON
Congratulations on acquiring the Asian OCR add-on!
By installing it, Readiris offers the recognition of four extra Asian languages
languages:
Japanese, Simplified and Traditional Chinese and Korean. (It goes without saying
that a large number of Asian languages such as Malay, Tagalog etc. are supported by the standard Readiris software because they use the Latin alphabet.)
Similar software options are available to recognize Hebrew and Arabic documents. The Arabic OCR add-on is exclusively available to users of Readiris
Corporate!
The extra CD-ROM Asian OCR Add-On complements your installed
Readiris license - the add-on only works on version 11 and later of the Readiris
software. This technical note explains all you need to know to successfully recognize Asian documents.
A WORD
ABOUT
THESE ASIAN LANGUAGES
Contrary to the Latin, Greek and Cyrillic alphabets, these languages use small
ideograms
icons (ideograms
ideograms), not letters, to represent reality.
Japanese
The Japanese writing system actually uses a mix of alphabets: the phonetic
character sets are called Hiragana and Katakana, and the ideograms, Chinese in
origin, are called Kanji. In normal Japanese writing, Hiragana and Kanji are used,
while Katakana is used for words borrowed from the (non-Chinese) foreign languages.
asian.PMD
20/02/2006, 10:36
An educated person can read about 10,000 Kanji symbols; the government
has published a list of some 2,000 symbols that it considers basic.
Japanese is generally written vertically beginning on the right, but many texts
today are written horizontally to permit the inclusion of English words, Arabic
numerals and mathematical and chemical formulae.
Simplified and Traditional Chinese
As a script, Chinese is derived from picture writing. It is written with thousands of distinctive characters called ideograms or pictograms which have
no relation to the sound of a word. The earliest Chinese characters were pictographs, such as a crescent for the moon, or a circle with a dot in the center to
represent the sun. Gradually, these gave way to non-pictorial ideographs which,
in addition to standing for tangible objects, also represented abstract concepts.
The majority of Chinese characters consist of two elements: a signific, which
indicates the meaning of a word, and a phonetic, which indicates the sound.
In a large dictionary there are 40,000 to 50,000 characters (many of which are
archaic or obscure), while the telegraphic code book contains nearly 10,000 symbols. Some 3,000 symbols are used on a daily basis.
asian.PMD
20/02/2006, 10:36
USERS GUIDE
Simplified Chinese is a simplified version of the traditional Chinese; the
500 most common symbols were simplified. Simplified Chinese is used on Chinas
mainland and in Singapore, Traditional Chinese is used by Hong Kong, Taiwan, Macau and the overseas Chinese communities.
Every character has exactly the same amount of space, no matter what its
shape may be. There are no spaces between characters; the characters which
make up multi-syllable words are not grouped together. When reading Chinese,
you have to work out which characters belong together!
Chinese can be written vertically and from right to left or horizontally from left
to right.
asian.PMD
20/02/2006, 10:36
Korean
Korean is not related to Chinese, although it has used the Chinese characters,
together with the Korean alphabet, for many centuries. (70% of the vocabulary
comes from Chinese.)
The Korean alphabet, the Hangul script, invented in the years 1443-46, is
the only true alphabet native to the Far East.
The shapes of some consonants are graphical representations of the speech
organs used to pronounce them! Other consonants were created by adding extra
lines to the basic shapes. The shapes of the vowels are based on three elements:
man (a vertical line), earth (a horizontal line) and heaven (a dot). The dot became
a short line.
The letters are combined into syllable blocks. As is true for the Latin languages, words can contain several syllables and spaces are inserted between the
words.
Korean is generally written horizontally from left to right but can also be written vertically from right to left.
S YSTEM R EQUIREMENTS
The Asian OCR add-on does not modify the system requirements of Readiris.
It complements version 11 and later of the Readiris software - it does not
work on earlier versions.
This extra module takes about 50 MB of hard disk space.
asian.PMD
20/02/2006, 10:37
USERS GUIDE
However, you will need a localized Asian version of the Windows operating
system to make good use of the Asian texts. Alternatively, you can use Word
2003, Word 2002 and Word 2000 to view and edit such documents: Microsoft
Office 2003 System, Office XP and Office 2000 were specifically designed to
cope with documents in many different languages.
documents the Adobe Acrobat
None of this is needed to view Asian PDF documents:
and Adobe Reader software automatically update when documents in exotic languages - Japanese, Korean, Russian, Hebrew etc. - must be read!
I NST
ALLING
NSTALLING
THE
ASIAN OCR ADD- ON
As also holds for the basic software Readiris, the Asian OCR add-on
software is delivered exclusively on an autorunning CD-ROM
CD-ROM. To install, simply insert the CD-ROM in your CD-ROM drive and wait for the installation
program to start running.
Should the installation not begin to run when the CD-ROM is inserted in your
CD-ROM drive, run the setup program SETUP.EXE to install the software.
Users of Windows XP, Windows 2000 and Windows NT 4.0 must ensure that
they have the necessary access rights - contact the system administrator if
necessary.
Follow the on-screen instructions. The installer detects automatically where
your Readiris software is installed - the software will only install if Readiris 11 is
present on your PC -, but some installation options are offered: you can limit the
installation to specific languages. A sample image is installed automatically for
the selected languages.
asian.PMD
20/02/2006, 10:37
As this add-on software smoothly complements your installed Readiris software, no new submenu or application is added by the installation program. But
you will find a shortcut to this electronic document added to the submenu "I.R.I.S.
Applications - Readiris".
asian.PMD
20/02/2006, 10:37
USERS GUIDE
U NINST
ALLING
NINSTALLING
THE
A SIAN OCR ADD- ON
Theres only one correct way of uninstalling Readiris - by using the Windows
(un)install wizard. You are strongly recommended not to uninstall this Readiris
software module by manually erasing the program files.
Execute the following steps to make use of the Windows (un)install wizard.
Click "Settings" under the "Start" menu of Windows and go to the "Control Panel".
Click the icon "Add/Remove Programs" under the control panel.
Follow the on-screen instructions to remove the Asian OCR add-on
module.
asian.PMD
20/02/2006, 10:37
10
CONFIGURING
AN
ASIAN WORKING ENVIRONMENT
So much for the installation of the Asian OCR software. But we must also
ensure that your computer system handles the ideograms of these Asian languages correctly.
It is not necessary to install a localized Asian version of the Windows operating system to make good use of such recognized texts. You can also use Word
2003
2003, Word 2002 and Word 2000 to view and edit such documents: Microsoft
Office 2003 System, Office XP and 2000 were specifically designed to cope with
documents in many different languages.
Refer to the documentation supplied with your Windows or Office software to
learn how to set up and use your Asian-enabled environment.
Use an Asian version of the Adobe Reader or Adobe Acrobat software to
documents (The Asian versions of Adobe Reader
view and edit Asian PDF documents.
can be found on the Readiris CD-ROM!)
Taking these steps ensures that your computer system copes with the symbols
(ideograms) of these Asian languages. If your operating system is not Asianenabled, you will inevitably generate illegible output whenever you try to display
text in one of these languages. This phenomenon is not caused by the Readiris
software, but by the setup of Windows: Readiris does recognize the ideograms of
these Asian languages, but when you open the text file with your wordprocessor,
your computer system does not represent them correctly on your computer screen.
documents the Adobe Acrobat
None of this is needed to view Asian PDF documents:
and Adobe Reader software automatically update when documents in exotic languages - Japanese, Korean, Russian, Hebrew etc. - must be read! (Several versions of Adobe Reader can be found on the Readiris CD-ROM.)
RECOGNIZING A SIAN D OCUMENTS
Assuming that your environment is set up correctly, we will now turn to the
recognition of these languages.
asian.PMD
10
20/02/2006, 10:37
11
USERS GUIDE
When you start up your Readiris software, there are but few new elements in
the user interface. Well study these in detail.
Evidently, you can now select the languages Japanese, Simplified Chinese,
Traditional Chinese and Korean with the "Language" button on the main toolbar.
The language setting influences the page analysis
analysis. In other words, indicate
the language before you execute the page analysis! (Should you have forgotten
this, select the appropriate language afterwards and the page analysis is re-executed automatically...)
asian.PMD
11
20/02/2006, 10:37
12
Specialized analysis routines are used for these languages. The interline spacing is in most cases bigger than with Western texts, and the text is less dense:
the words are made up of small icons (ideograms) that could be seen as graphic
zones in Western documents. (The maximal character size of 72 points also holds
for the Asian languages.)
Thirdly, the text orientation may be different: the text may run from top to
bottom, from right to left. Readiris adapts itself by sorting the text blocks from
right to left!
asian.PMD
12
20/02/2006, 10:37
13
USERS GUIDE
Proceed as usual by clicking the button "Recognize-Save" on the main toolbar.
Dont try to set the document characteristics such as font type or character pitch
- these dont apply to Asian documents anyway.
Indicate the formatting options before you execute the OCR. Similar to
Western documents, you can apply autoformatting and recognize tables! However, the number of text filters adapted to the Asian alphabets is somewhat smaller
than the text formats available for Western documents.
asian.PMD
13
20/02/2006, 10:37
14
Generally speaking, the formatting options shouldnt bother you: if an unavailable option is currently enabled and you enable an Asian language, Readiris prompts
you to select an available option first.
asian.PMD
14
20/02/2006, 10:37
15
USERS GUIDE
Also note that support of a special application for Korean documents, the
Hanguel wordprocessor, was added!
asian.PMD
15
20/02/2006, 10:37
16
Learning is disabled as soon as you activate an Asian language, so youll
never enter the interactive phase at the end of the recognition. Learning hardly
makes sense for these languages which use thousands of different symbols, and
youd have to be able to enter the ideograms, not an easy task when using a
Western keyboard!
Lets see what our Japanese text looks like.
asian.PMD
16
20/02/2006, 10:37
17
USERS GUIDE
The orientation as applied to your source document is maintained across the
recognition.
asian.PMD
17
20/02/2006, 10:37
18
Also notice that these Asian texts may contain Western symbols - numbers,
untranscribable proper names etc. As with Greek-English and the Cyrillic-En-
asian.PMD
18
20/02/2006, 10:37
19
USERS GUIDE
glish language settings, Readiris in fact uses a mixed alphabet to encode these
documents.
asian.PMD
19
20/02/2006, 10:38
20
asian.PMD
20
20/02/2006, 10:38