TEXT REPRESENTATION
Chapter-03
WHAT IS DATA?
• Data is a piece of information, usually formatted in a
special way.
• This information may be in the form of text documents,
images, audio clips, software programs, etc.
WHAT IS DATA
EXAMPLES OF STANDARDS
Type of Data Standards
Alphanumeric ASCII, EBCDIC, Unicode
Image JPEG, GIF, PCX, TIFF
Motion picture MPEG-2, Quick Time
Sound Sound Blaster, WAV, AU
Outline graphics/fonts PostScript, TrueType, PDF
EXAMPLES OF STANDARDS
• Microsoft Word produces formatted text and creates documents in DOCX
format.
• Apple Pages produces documents in PAGES format.
• Adobe Acrobat produces documents in PDF format.
• HTML markup language used for Web pages produces documents in HTML
format.
WHY STANDARDS?
• They exist because they are
• Convenient
• Efficient
• Flexible
• Appropriate
• Etc.
BINARY CODES
• Example
NUMBER OF BITS IN BINARY CODES
• The number of possible bit-patterns (symbols) made of N
number of bits, M is given by:
M = 2N
• Inversely, the number of bits needed to construct M number of
symbols is given by:
N = log2 M ≈ 3.2 log10 M
(Note: N must be rounded to next bigger integer)
• Ex: for M = 26, what is the min number of bits?
N= Log2 26 = 3.2 Log10 26 = 4.5 = 5 bits
BITES AND BYTES
• All of the data stored and transmitted by digital devices is encoded as bits.
• Terminology related to bits and bytes is extensively used to describe storage
capacity and network access speed.
• The word bit, an abbreviation for binary digit, can be further abbreviated as
a lowercase b.
• A group of eight bits is called a byte and is usually abbreviated as an
uppercase B.
BITES AND BYTES
• When reading about digital devices, you’ll
frequently encounter references such as 90
kilobits per second, 1.44 megabytes, 2.8
gigahertz, and 2 terabytes.
• Kilo, mega, giga, tera, and similar terms are used
to quantify digital data.
BINARY CODED DECIMAL FORMAT
• It’s a way to represent decimal numbers directly in
binary without actually converting the number as a
whole to binary.
• For the 10 digits need a 4 bit code.
This coding is called Binary Coded
Decimal (BCD)
• The BCD is simply the 4 bit
representation of the decimal digit.
• 6 digits are not used.
EXAMPLE
• 709310 = ? (in BCD)
7 0 9 3
0111 0000 1001 0011
ALPHANUMERIC DATA
• How do you handle alphanumeric data?
• Alphanumeric – consisting of both letters and
numerals
• Easy answer! Formulate a binary code to represent
each character.
– For the 26 letter of the alphabet would need 5 bit for
representation.
– But what about the upper case and lower case, and the
digits, and special characters?
CODE SYSTEMS FOR ALPHANUMERIC
DATA
• Various code systems are used to represent Alphanumeric symbols:
1. ASCII (American Standard Code for Information Interchange)
2. Extended ASCII
3. EBCDIC (Extended Binary Coded Decimal Interchange Code)
4. Unicode (Universal Code)
ASCII
• ASCII stands for American
Standard Code for
Information Interchange
• The code uses 7 bits to
encode 128 unique
characters
EXTENDED ASCII
• It is invented to make the bit-pattern length equal to 8 bits (Byte),
by adding a bit to the left of the ASCII code representation.
Ex. If ASCII code is 1111111 the extended ASCII code is 01111111.
• Using eight bits instead of seven bits allows Extended ASCII to
provide codes for 256 characters.
• “Extended ASCII” codes start with a one-valued bit; these codes are
not standard and vary in meaning among different manufactures
and equipment.
EXTENDED ASCII
UNICODE
• Unicode can represent most of the world's characters in
modern computer use, including technical symbols and
special characters used in publishing.
• One Universal Code for every character
• no matter what the platform is
• no matter what the program is
• no matter what the language is
• It is a superset of ASCII
UNICODE
• The standard is maintained by the Unicode Consortium
• As of May 2019 the most recent version, Unicode 12.1, contains a
repertoire of 137,994 characters covering 150 modern and
historic scripts, as well as multiple symbol sets and emoji.
• Unicode can be implemented by different character encodings.
• The Unicode standard defines UTF-8, UTF-16, and UTF-32, and
several other encodings are in use.
• UTF - Unicode Transformation Format