XML
Dr. Koyel Datta Gupta
XML
XML is a meta markup language
for text documents / textual data
XML allows to define languages
(„applications“) to represent text
documents / textual data
Possible Advantages of
Using XML
Truly Portable Data
Easily readable by human users
Very expressive (semantics near data)
Very flexible and customizable (no finite
tag set)
Easy to use from programs (libs available)
Easy to convert into other representations
(XML transformation languages)
Many additional standards and tools
Widely used and supported
Example of an XML Document
<?xml version=“1.0”?>
<address>
<name>Alice Lee</name>
<email>[email protected]</email>
<phone>212-346-1234</phone>
<birthday>1985-03-22</birthday>
</address>
Difference Between HTML and
XML
HTML tags have a fixed meaning and
browsers know what it is.
XML tags are different for different
applications, and users know what
they mean.
HTML tags are used for display.
XML tags are used to describe
documents and data.
XML Rules
Tags are enclosed in angle brackets.
Tags come in pairs with start-tags
and end-tags.
Tags must be properly nested.
<name><email>…</name></email> is not
allowed.
<name><email>…</email><name> is.
Tagsthat do not have end-tags must
be terminated by a ‘/’.
<br /> is an html example.
More XML Rules
Tags are case sensitive.
<address> is not the same as <Address>
XML in any combination of cases is not
allowed as part of a tag.
Tags may not contain ‘<‘ or ‘&’.
Tags follow Java naming conventions,
except that a single colon and other
characters are allowed. They must begin
with a letter and may not contain white
space.
Documents must have a single root tag
that begins the document.
Encoding
XML (like Java) uses Unicode to encode
characters.
Unicode comes in many flavors. The most
common one used in the West is UTF-8.
UTF-8 is a variable length code. Characters are
encoded in 1 byte, 2 bytes, or 4 bytes.
The first 128 characters in Unicode are ASCII.
In UTF-8, the numbers between 128 and 255
code for some of the more common characters
used in western Europe, such as ã, á, å, or ç.
Two byte codes are used for some characters not
listed in the first 256 and some Asian ideographs.
Four byte codes can handle any ideographs that
are left.
Those using non-western languages should
investigate other versions of Unicode.
Well-Formed Documents
An XML document is said to be well-formed if it
follows all the rules.
An XML parser is used to check that all the rules
have been obeyed.
Recent browsers come with XML parsers.
Java 1.4 also supports an open-source parser.
XML Example Revisited
<?xml version=“1.0”?>
<address>
<name>Alice Lee</name>
<email>[email protected]</email>
<phone>212-346-1234</phone>
<birthday>1985-03-22</birthday>
</address>
Markup for the data aids understanding of its
purpose.
A flat text file is not nearly so clear.
Alice Lee
[email protected]
212-346-1234
1985-03-22
The last line looks like a date, but what is it for?
Expanded Example
<?xml version = “1.0” ?>
<address>
<name>
<first>Alice</first>
<last>Lee</last>
</name>
<email>
[email protected]</email>
<phone>123-45-6789</phone>
<birthday>
<year>1983</year>
<month>07</month>
<day>15</day>
</birthday>
</address>
XML Files are Trees
address
name email phone birthday
first last
year month day
XML Trees
An XML document has a single root
node.
The tree is a general ordered tree.
A parent node may have any number of
children.
Child nodes are ordered, and may have
siblings.
Preordertraversals are usually used
for getting information out of the
tree.
Validity
A well-formed document has a tree
structure and obeys all the XML rules.
A particular application may add more
rules in either a DTD (document type
definition) or in a schema.
Many specialized DTDs and schemas have
been created to describe particular areas.
These range from disseminating news
bulletins (RSS) to chemical formulas.
DTDs were developed first, so they are not
as comprehensive as schema.
Document Type Definitions
A DTD describes the tree structure of
a document and something about its
data.
There are two data types, PCDATA
and CDATA.
PCDATA is parsed character data.
CDATA is character data, not usually
parsed.
A DTD determines how many times a
node may appear, and how child
nodes are ordered.
Defining Attributes in the DTD
Let's start by defining the attributes for
the elements in the slide presentation.
Note:
Add the text highlighted below to
define the attributes for
the slideshow element:
<!ELEMENT music (song)>
<!ATTLIST song title CDATA
#REQUIRED date CDATA #IMPLIED
author CDATA "unknown" >
Defining Entities in the DTD
So far, you've seen predefined
entities like & and you've seen
that an attribute can reference an
entity. It's time now for you to learn
how to define entities of your own.
<!ENTITY entity-name "entity-
value">
DTD for address Example
<!ELEMENT address (name, email, phone,
birthday)>
<!ELEMENT name (first, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT birthday (year, month, day)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>
ENTITY example
subject_name.dtd
<!ELEMENT subject_name (#PCDATA)>
<!ENTITY WT “WEB TECHNOLOGY">
● subject.xml
<!DOCTYPE subject_name SYSTEM
subject_name.dtd">
<subject_name>&WT;</subject_name>
XSLT
Extensible Stylesheet Language
Transformations
XSLT is used to transform one xml
document into another, often an html
document.
The Transform classes are now part of Java
1.4.
A program is used that takes as input one
xml document and produces as output
another.
If the resulting document is in html, it can
be viewed by a web browser.
This is a good way to display xml data.
A Style Sheet to Transform
address.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:template match="address">
<html><head><title>Address
Book</title></head>
<body>
<xsl:value-of select="name"/>
<br/><xsl:value-of
select="email"/>
<br/><xsl:value-of
select="phone"/>
<br/><xsl:value-of
select="birthday"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>