XML Syntax
Rules of XML Language
Sep-2009
© 2008 MindTree Consulting
Agenda
Need for XML
Quiz
XML Syntax - Rules of XML language
Slide 2
Need For XML
Revision of previous session
Quiz
© 2008 MindTree Consulting
A First Look at XML
The idea behind XML is deceptively simple. It aims at answering the
conflicting demands that arrive at the W3C for the future of HTML.
On one hand, people need more tags. And these new tags are
increasingly specialized. For example, mathematicians want tags
for formulas. Chemists also want tags for formulas but they are not
the same.
On the other hand, authors and developers want fewer tags.
HTML is already so complex! As handheld devices gain in
popularity, the need for a simpler markup language also is apparent
because small devices, like the PalmPilot, are not powerful enough
to process HMTL pages.
Slide 4
XML
How can you have both more tags and fewer tags in a single
language?
To resolve this dilemma, XML makes essentially two changes to HTML:
It predefines no tags.
It is stricter.
Slide 5
What is Markup
In an electronic document, the markup is the codes, embedded
with the document text, which store the information required for
electronic processing, like font name, boldness or, in the case of
XML, the document structure. This is not specific to XML. Every
electronic document standard uses some sort of markup.
Slide 6
Applications of XML
Publishing
XML is being used by an increasing number of publishers as the format
for documents.
Example XML document for a monthly newsletter. As you can see, it uses
elements for the title, abstract, paragraphs, and other concepts common
in publishing.
Business Document Exchange
For example placing the order in XML rather than on paper. Advantage is
that software can process it. An application could read this order and
automatically fulfill it.
RSS / Atom
Eg Bloglines
Slide 7
XML Introduction - Quiz
Basic questions on XML Introduction
© 2008 MindTree Consulting
XML Introduction - Quiz
XML stands for
XML is about the description of data, and not its presentation.
XML allows us to define your own tags, so we can create our own
markup languages.
The XML specification is owned by W3C
XML is designed to be both machine readable and human readable.
XML provides a platform-neutral, language-independent means of
describing data.
Obviously, it’s the markup that differentiates the XML document
from plain text.
Slide 9
The XML Syntax
Start & End Tags, Elements, Element nesting XML Names, Attributes,
XML Declaration, Entities, CDATA, Comments, Processing Instructions,
Well formed XML
© 2008 MindTree Consulting
XML - Example
Listing 2.1: An Address Book in XML
<?xml version=”1.0”?>
<address-book>
<entry><name>John Doe</name>
<address><street>34 Fountain Square Plaza</street>
<region>OH</region><postal-code>45202</postal-code>
<locality>Cincinnati</locality><country>US</country>
</address>
<tel preferred=”true”>513-555-8889</tel>
<tel>513-555-7098</tel>
<email href=”mailto:
[email protected]”/>
</entry>
<entry><name><fname>Jack</fname><lname>Smith</lname></name>
<tel>513-555-3465</tel>
<email href=”mailto:
[email protected]”/>
</entry>
</address-book>
Slide 11
Element’s Start and End Tags
The building block of XML is the element, as that’s what comprises
XML documents. Each element has a name and a content.
<tel>513-555-7098</tel>
The content of an element is delimited by special markups known
as start tag and end tag.
Unlike HTML, both start and end tags are required. The following is
not correct in XML:
<tel>513-555-7098
Slide 12
Names in XML
Element names must follow certain rules. As we will see, there are other
names in XML that follow the same rules.
Names in XML must start with either a letter or the underscore character
(“_”). The rest of the name consists of letters, digits, the underscore
character, the dot (“.”), or a hyphen (“-”). Spaces are not allowed in
names.
Finally, names cannot start with the string “xml”, which is reserved for the
XML specification itself.
Unlike HTML, names are case sensitive in XML.
By convention, XML elements are frequently written in lowercase. When a
name consists of several words, the words are usually separated by a
hyphen, as in address-book or written as AddressBook. Choose the
convention that works best for you but try to be consistent.
Slide 13
Names in XML - Quiz
The following are examples of valid or invalid element names in
XML: <copyright-information> <p> <base64> <décompte.client>
<firstname> <123> <first name> <tom&jerry>
Slide 14
Attributes
It is possible to attach additional information to elements in the form of
attributes.
Attributes have a name and a value. The names follow the same rules as
element names.
The syntax is similar to HTML. Elements can have one or more attributes in
the start tag, and the name is separated from the value by the equal
character.
The value of the attribute is enclosed in double or single quotation marks.
For example, the tel element can have a preferred attribute:
<tel preferred=”true”>513-555-8889</tel>
Unlike HTML, XML insists on the quotation marks. The XML processor would
reject the following:
<tel preferred=true>513-555-8889</tel>
Slide 15
Attributes - Quiz
Correct / Incorrect
<confidentiality level=”I don’t know”>
This document is not confidential.
</confidentiality>
or
<confidentiality level=’approved “for your eyes only”’>
This document is top-secret
</confidentiality>
Attribute names without values not allowed. (Yes / No)
Attribute names without delimiters not allowed ( Yes / no)
Only one instance of an attribute tag is allowed within a given tag. (Yes /
no)
Slide 16
Empty Element
Elements that have no content are known as empty elements.
Usually, they are enclosed in the document for the value of their
attributes.
There is a shorthand notation for empty elements: The start and
end tags merge and the slash from the end tag is added at the end
of the opening tag.
For XML, the following two elements are identical:
<email href=”mailto:
[email protected]”/>
<email href=”mailto:
[email protected]”></email>
Quiz
An empty element tag can have attributes. ( Yes / no)
Slide 17
Nesting of Elements
Element content is not limited to text; elements can contain other
elements that in turn can contain text or elements and so on.
An XML document is a tree of elements. There is no limit to the depth of
the tree, and elements can repeat. As you see in Listing 2.1, there are two
entry elements in the address-book element. The entry for John Doe has
two tel elements. Figure 2.1 is the tree of Listing 2.1. [Refer: XML Example
slide]
An element that is enclosed in another element is called a child. The
element it is enclosed into is its parent.
<name>
<fname>Jack</fname>
<lname>Smith</lname>
</name>
Start and end tags must always be balanced and children are always
completely enclosed in their parents. Following is legal or illegal?
<name><fname>Jack</fname><lname>Smith</name></lname>
Slide 18
Root
At the root of the document there must be one and only one
element. In other words, all the elements in the document must be
the children of a single element.
Quiz: Following example is legal or illegal?
<?xml version=”1.0”?>
<entry>
<name>John Doe</name>
<email href=”mailto:[email protected]”/>
</entry>
<entry>
<name>JackSmith</name>
<email href=”mailto:[email protected]”/>
</entry>
Slide 19
XML Declaration
The XML declaration is the first line of the document. The
declaration identifies the document as an XML document. The
declaration also lists the version of XML used in the document.
<?xml version=”1.0”?>
The declaration can contain other attributes to support other
features such as character set encoding.
The XML declaration is optional.
If the declaration is included however, it must start on the first
character of the first line of the document. The XML
recommendation suggests you include the declaration in every XML
document.
Slide 20
XML Declaration – Stand-alone document
If an XML document can be read with no reference to external sources, it is said to
be a stand-alone document. Such documents can be annotated with a standalone
attribute with a value of yes in the XML declaration. If an XML document requires
external sources to be resolved to parse correctly and/or to construct the entire
data tree (for example, a document with references to external general entities),
then it is not a stand-alone document. Such documents may be marked
standalone='no', but because this is the default, such an annotation rarely appears in
XML documents.
XML declarations
<?xml version='1.0' ?>
<?xml version='1.0' encoding='US-ASCII' ?>
<?xml version='1.0' encoding='US-ASCII' standalone='yes' ?>
<?xml version='1.0' encoding='UTF-8' ?>
<?xml version='1.0' encoding='UTF-16' ?>
<?xml version='1.0' encoding='ISO-10646-UCS-2' ?>
<?xml version='1.0' encoding='ISO-8859-1' ?>
<?xml version='1.0' encoding='Shift-JIS' ?>
Slide 21
Comments
To insert comments in a document, enclose them between “<!--”
and “-->”.
Comments are used for notes, indication of ownership, and more.
They are intended for the human reader and they are ignored by
the XML processor.
<!– This is a comment -->
Comments cannot be inserted in the markup. They must appear
before or after the markup.
Slide 22
Unicode
Characters in XML documents follow the Unicode standard.
XML uses the 16 bit Unicode character set.
XML processor must recognize the UTF-8 and UTF-16 encodings.
Most processors support other encodings. In particular, for Western
European languages, they support ISO 8859-1 (the official name for Latin-
1).
Documents that use encoding other than UTF-8 or UTF-16 must start with
an XML declaration. The declaration must have an attribute encoding to
announce the encoding used. For example, a document written in Latin-1
(such as with Windows Notepad) could use the following declaration:
<?xml version=”1.0” encoding=”ISO-8859-1”?>
<entrée>
<nom>José Dupont<nom/>
<email href=”mailto:
[email protected]”/>
</entrée>
Slide 23
XML Declaration - Quiz
How the XML processor can read the encoding parameter. Indeed,
to reach the encoding parameter, the processor must read the
declaration. However, to read the declaration, the processor needs
to know which encoding is being used.
What about those documents that have no declaration (since the
declaration is optional)?
Slide 24
Entities
XML organizes documents physically in entities. In some cases,
entities are equivalent to files; in others, they are not.
Entities are inserted in the document through entity references
(the name of the entity between an ampersand character and a
semicolon).
For the application, the entity reference is replaced by the content
of the entity.
If we assume we have defined an entity “us,” which has the value
“United States,” the following two lines are equivalent:
<country>&us;</country>
<country>United States</country>
Slide 25
Predefined Entities in XML
XML predefines entities for the characters used in markup (angle brackets,
quotes, and so on). The entities are used to escape the characters from
element or attribute content. The entities are
< left angle bracket “<” must be escaped with <
& ampersand “&” must be escaped with &
> right angle bracket “>” must be escaped with > in the combination ]]> in
CDATA sections (see the following)
' single quote “‘” can be escaped with ' essentially in parameter
value
" double quote “”” can be escaped with " essentially in parameter
value
Quiz – Correct / Incorrect?
<company>Mark & Spencer</company>
<company>Mark & Spencer</company>
Slide 26
Character references
XML also supports character references where a letter is replaced by its
Unicode character code.
&#DecimalUnicodeValue;
Character references that start with &# provide a decimal representation of the character
code.
&#xHexadecimalUnicodeValue;
Character references that start with &#x provides a hexadecimal representation of the
character code.
Example - Character references
<?xml version='1.0' encoding='US-ASCII' ?>
<Personne occupation='étudiant' >
<nom>Martin</nom>
<langue>Français</langue>
</Personne>
Slide 27
Processing Instructions
Processing instructions (abbreviated PI) is a mechanism to insert
non-XML statements, such as scripts, in the document.
The processing instruction is enclosed in <? and ?>.
The first name is the target. It identifies the application or the
device to which the instructions are directed. The rest of the
processing instructions are in a format specific to the target. It
does not have to be XML.
<?xml-stylesheet href=”simple-ie5.xsl” type=”text/xsl”?>
<?xml version=”1.0” encoding=”ISO-8859-1”?>
Slide 28
CDATA Sections
As you have seen, markup characters (left angle bracket and ampersand)
that appear in the content of an element must be escaped with an entity.
For some applications, it is difficult to escape markup characters, if only
because there are too many of them. Also, it is difficult to include an XML
document in an XML document.
CDATA (Character DATA) sections are intended for these cases. CDATA
sections are delimited by “<[CDATA[” and “]]>”. The XML processor ignores
all markup except for]]>
PCDATA stands for parsed character data and means the element can
contain text. #PCDATA is often (but not always) used for leaf elements.
The difference between CDATA and PCDATA is that PCDATA cannot contain
markup characters.
Slide 29
More on CDATA Sections
Syntax
<![CDATA[…]]>
The ‘…’ section can contain any character string that does not contain
the “]]>” string literal.
May contain most markup characters.
May occur anywhere that character data may occur.
Cannot be nested.
Cannot be empty
Will not be processed by the parser.
Slide 30
CDATA Section - Example
The following example uses a CDATA section to insert an XML
example into an XML document:
<?xml version=”1.0”?>
<example>
<[CDATA[
<?xml version=”1.0”?>
<entry>
<name>John Doe</name>
<email href=”mailto:
[email protected]”/>
</entry>]]>
</example>
Slide 31
Well Formed XML
The end tag matches the corresponding start tag, and there is:
No overlapping in element definitions.
No instances of multiple attributes with the same name for one element
Syntax conforms to the XML Specifications
Start-tags all have matching end-tags (or are empty-element tags).
Element tags do not overlap.
Attributes have unique names.
Markup characters are properly escaped.
Elements form a hierarchical tree, with a single root node.
There are no references to external entities, except if a DTD is
provided.
Slide 32
Well formed XML - example
<?xml version="1.0" encoding="UTF-8"?>
<employees>
<employee id="IN9999">
<name>
<firstname>Suraj</firstname>
<middlename>Kumar</middlename>
<surname>Verma</surname>
</name>
<department>IT Services</department>
<project>C2</project>
<details><![CDATA[Some data here >, even these symbols don't bother it.]]></details>
</employee>
<employee id="IN9498">
<name>
<firstname>Abhi</firstname>
<surname>Dhar</surname>
</name>
<department>R&D Services</department>
<project/>
</employee>
</employees>
Slide 33
Four Common Errors in XML Syntax
Forget End Tags
Forget That XML Is Case Sensitive
Introduce Spaces in the Name of Element
<address book>
<entry>
<name>John Doe</name>
<email href=”mailto:
[email protected]”/>
</entry>
</address book>
Forget the Quotes for Attribute Value
<tel preferred=true>513-555-8889</tel>
Slide 34
Questions
Slide 35
Thank you
XML Technology, Semester 4
SICSR Executive MBA(IT) @ MindTree, Bangalore, India
By Neeraj Singh (toneeraj(AT)gmail(DOT)com
)
Slide 36