Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views52 pages

UNIT-I WebTechnology

The document provides an overview of web essentials, including the structure of the Internet, basic Internet protocols, and markup languages like XHTML. It discusses the history and development of the Internet, key protocols such as IP, TCP, and SMTP, and the differences between email retrieval protocols POP3 and IMAP. Additionally, it covers the communication between web clients and servers, emphasizing the importance of protocols in facilitating data exchange.

Uploaded by

arunasekaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views52 pages

UNIT-I WebTechnology

The document provides an overview of web essentials, including the structure of the Internet, basic Internet protocols, and markup languages like XHTML. It discusses the history and development of the Internet, key protocols such as IP, TCP, and SMTP, and the differences between email retrieval protocols POP3 and IMAP. Additionally, it covers the communication between web clients and servers, emphasizing the importance of protocols in facilitating data exchange.

Uploaded by

arunasekaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT I

Web Essentials: Clients, Servers, and Communication. The Internet-Basic Internet


Protocols -The World Wide Web-HTTP request message-response message-Web
Clients Web Servers-Case Study. Markup Languages: XHTML. An Introduction to
HTML History-Versions-Basic XHTML Syntax and Semantics-Some Fundamental
HTMLElements-Relative URLs-Lists-tables-Frames-Forms-XML Creating HTML
Documents Case Study.

1.1 Web Essentials: Clients, Servers, and Communication

The Internet
he Internet is a global system of interconnected computer networks that use the standard
Internet Protocol Suite (TCP/IP) to serve billions of users worldwide. It is a network of
networks that consists of millions of private, public, academic, business, and government
networks, of local to global scope, that are linked by a broad array of electronic and
optical networking technologies. The Internet carries a vast range of information
resources and services, such as the inter-linked hypertext documents of the World Wide
Web (WWW) and the infrastructure to support electronic mail.

The origins of the Internet reach back to the 1960s with both private and United States
military research into robust, fault-tolerant, and distributed computer networks. The
funding of a new U.S. backbone by the National Science Foundation, as well as private
funding for other commercial backbones, led to worldwide participation in the
development of new networking technologies, and the merger of many networks. The
commercialization of what was by then an international network in the mid 1990s
resulted in its popularization and incorporation into virtually every aspect of modern
human life. As of 2009, an estimated quarter of Earth's population used the services of
the Internet.

..One of earliest attempts to network heterogeneous, geographically dispersed computers

.Email first available on ARPANET in 1972 (and quickly very popular!)

1
.ARPANET access was limited to select DoD-funded organizations

.Open-access networks

.Regional university networks (e.g., SURAnet)

.CSNET for CS departments not on ARPANET

.NSFNET (1985-1995)

.Primary purpose: connect supercomputer centers

.Secondary purpose: provide backbone to connect regional networksNSFNetBackbone

The 6 supercomputer centers connected by the early NSFNET backbone

.Original NSFNET backbone speed: 56 kbit/s

.Upgraded to 1.5 Mbit/s (T1) in 1988

.Upgraded to 45 Mbit/s (T3) in 1991

.In 1988, networks in Canada and France connected to NSFNET

.In 1990, ARPANET is decommissioned, NSFNET the center of the internet

.Internet: the network of networks connected via the public backbone and communicating

using TCP/IP communication protocol

.Backbone initially supplied by NSFNET privately funded (ISP fees) beginning in 1995
Uses:

 On-line employment

 Net Banking

 On-line education system

 Making friends and participating in a discussion

2
 Send and receive electronic greetings from friends and relatives

 Go through the catalog of a library

1.2 Basic Internet Protocols

1.2.1 Internet Protocol (IP)

The Internet Protocol (IP) is the principal communications protocol used for relaying
datagrams (packets) across an internetwork using the Internet Protocol Suite. Responsible
for routing packets across network boundaries, it is the primary protocol that establishes
the Internet.

IP is the primary protocol in the Internet Layer of the Internet Protocol Suite and has the
task of delivering datagrams from the source host to the destination host solely based on
their addresses. For this purpose, IP defines addressing methods and structures for
datagram encapsulation.

Historically, IP was the connectionless datagram service in the original Transmission


Control Program introduced by Vint Cerf and Bob Kahn in 1974, the other being the
connection-oriented Transmission Control Protocol (TCP). The Internet Protocol Suite is
therefore often referred to as TCP/IP.

The first major version of IP, now referred to as Internet Protocol Version 4 (IPv4) is the
dominant protocol of the Internet, although the successor, Internet Protocol Version 6
(IPv6) is in active, growing deployment worldwide.

1.2.2 Transmission Control Protocol


When two computers wish to exchange information over a network, there are several
components that must be in place before the data can actually be sent and received. Of
course, the physical hardware must exist, which is typically either a network interface
card (NIC) or a serial communications port for dial-up networking connections.

3
Apart from this physical connection computers also need to use a protocol which defines
the parameters of the communication between them. In short, a protocol defines the
"rules of the road" that each computer must follow so that all of the systems in the
network can exchange data. One of the most popular protocols in use today is TCP/IP,
which stands for Transmission Control Protocol/Internet Protocol.

By convention, TCP/IP is used to refer to a suite of protocols, all based on the Internet
Protocol (IP). Unlike a single local network, where every system is directly connected to
each other, an internet is a collection of networks, combined into a single, virtual
network. The Internet Protocol provides the means by which any system on any network
can communicate with another as easily as if they were on the same physical network.

When a system sends data over the network using the Internet Protocol, it is sent in
discrete units called datagrams, also commonly referred to as packets. A datagram
consists of a header followed by application-defined data. The header contains the
addressing information which is used to deliver the datagram to its destination, much like
an envelope is used to address and contain postal mail. And like postal mail, there is no
guarantee that a datagram will actually arrive at its destination. In fact, datagrams may be
lost, duplicated or delivered out of order during their travels over the network.

TCP is known as a connection-oriented protocol. In other words, before two programs


can begin to exchange data they must establish a "connection" with each other. This is
done with a three-way handshake in which both sides exchange packets and establish the
initial packet sequence numbers (the sequence number is important because, as
mentioned above, datagrams can arrive out of order; this number is used to ensure that
data is received in the order that it was sent). When establishing a connection, one
program must assume the role of the client, and the other the server. The client is
responsible for initiating the connection, while the server’s responsibility is to wait, listen
and respond to incoming connections. Once the connection has been established, both
sides may send and receive data until the connection is closed.

1.2.3 User Datagram Protocol

4
Unlike TCP, the User Datagram Protocol (UDP) does not present data as a stream of
bytes, nor does it require that you establish a connection with another program in order to
exchange information. Data is exchanged in discrete units called datagrams, which are
similar to IP datagrams. In fact, the only features that UDP offers over raw IP datagrams
are port numbers and an optional checksum.

UDP is sometimes referred to as an unreliable protocol because when a program sends a


UDP datagram over the network, there is no way for it to know that it actually arrived at
its destination. This means that the sender and receiver must typically implement their
own application protocol on top of UDP.

1.2.4 Hostnames
In order for an application to send and receive data with a remote process, it must have
several pieces of information. The first is the IP address of the system that the remote
program is running on.

Although this address is internally represented by a 32-bit number, it is typically


expressed in either dot-notation or by a logical name called a hostname. Like an address
in dot-notation, hostnames are divided into several pieces separated by periods, called
domains. Domains are hierarchical, with the top-level domains defining the type of
organization that network belongs to, with sub-domains further identifying the specific
network.

top-level domains are

"gov" (government agencies),

"com" (commercial organizations),

5
"edu" (educational institutions) and

"net" (Internet service providers).

The fully qualified domain name is specified by naming the host and each parent sub-
domain above it, separating them with periods. For example, the fully qualified domain
name for the "jupiter" host would be "jupiter.catalyst.com". In other words, the system
"jupiter" is part of the "catalyst" domain (a company’s local network) which in turn is
part of the "com" domain (a domain used by all commercial enterprises).

1.2.5 SMTP (Simple Mail Transfer Protocol)

An Overview

This section contains descriptions of the procedures used in SMTP:

session initiation, the mail transaction, forwarding mail, verifying mailbox names and
expanding mailing lists, and the opening and closing exchanges. Comments on relaying,
a note on mail domains, and a discussion of changing roles are included at the end of this

section.

Session Initiation

An SMTP session is initiated when a client opens a connection to a

server and the server responds with an opening message.

6
SMTP server implementations MAY include identification of their software and version
information in the connection greeting reply after the 220 code, a practice that permits
more efficient isolation and repair of any problems. Implementations MAY make
provision for SMTP servers to disable the software and version announcement where

it causes security concerns. While some systems also identify their contact point for mail
problems, this is not a substitute for maintaining the required "postmaster" address

The SMTP protocol allows a server to formally reject a transaction while still
allowing the initial connection as follows: a 554 response MAY be given in the initial
connection opening message instead of the 220. A server taking this approach MUST
still wait for the client to send a QUIT (see section 4.1.1.10) before closing the
connection and SHOULD respond to any intervening commands with

Client Initiation

Once the server has sent the welcoming message and the client has received it, the client
normally sends the EHLO command to the server, indicating the client's identity. In
addition to opening the session, use of EHLO indicates that the client is able to process

service extensions and requests that the server provide a list of the extensions it supports.
Older SMTP systems which are unable to support service extensions and contemporary
clients which do not require service extensions in the mail session being initiated, MAY

use HELO instead of EHLO. Servers MUST NOT return the extended EHLO-style
response to a HELO command. For a particular connection attempt, if the server returns
a "command not recognized" response to EHLO, the client SHOULD be able to fall back
and send HELO.

In the EHLO command the host sending the command identifies itself; the command may
be interpreted as saying "Hello, I am <domain>" (and, in the case of EHLO, "and I
support service extension requests").

Mail Transactions

7
There are three steps to SMTP mail transactions. The transaction starts with a MAIL
command which gives the sender identification. (In general, the MAIL command may be
sent only when no mail

transaction is in progress; see section 4.1.4.) A series of one or more RCPT commands
follows giving the receiver information. Then a DATA command initiates transfer of the
mail data and is terminated by the "end of mail" data indicator, which also confirms the

transaction.

The first step in the procedure is the MAIL command.

MAIL FROM:<reverse-path> [SP <mail-parameters> ] <CRLF>

This command tells the SMTP-receiver that a new mail transaction is starting and to reset
all its state tables and buffers, including any recipients or mail data. The <reverse-path>
portion of the first or only argument contains the source mailbox (between "<" and ">"
brackets), which can be used to report errors (see section 4.2 for a discussion of error
reporting). If accepted, the SMTP server returns a 250 OK reply. If the mailbox
specification is not acceptable for some reason, the server MUST return a reply indicating
whether the

DATA <CRLF>

If accepted, the SMTP server returns a 354 Intermediate reply and considers all
succeeding lines up to but not including the end ofmail data indicator to be the message
text. When the end of text is successfully received and stored the SMTP-receiver sends a
250 OK reply.Since the mail data is sent on the transmission channel, the end of mail data
must be indicated so that the command and reply dialog can be resumed. SMTP indicates
the end of the mail data by sending a line containing only a "." (period or full stop). A
transparency procedure is used to prevent this from interfering with the user's text.

8
The end of mail data indicator also confirms the mail transaction and tells the SMTP
server to now process the stored recipients and mail data. If accepted, the SMTP server
returns a 250 OK reply. The DATA command can fail at only two points in the protocol
exchange:

- If there was no MAIL, or no RCPT, command, or all such commands were rejected, the
server MAY return a "command out of sequence" (503) or "no valid recipients" (554)
reply in response to the DATA command. If one of those replies (or any other 5yz reply)
is received, the client MUST NOT send the message data; more generally, message data
MUST NOT be sent unless a 354 reply is received.

1.2.6 POP 3

The Post Office Protocol version 3 (POP3) is an application-layer Internet


standard protocol used by local e-mail clients to retrieve e-mail from a remote server over
a TCP/IP connection. POP3 and IMAP4 (Internet Message Access Protocol) are the two
most prevalent Internet standard protocols for e-mail retrieval. Virtually all modern e-
mail clients and servers support both.

POP3 has made earlier versions of the protocol, informally called POP1 and
POP2, obsolete. In contemporary usage, the less precise term POP almost always means
POP3 in the context of e-mail protocols.

The design of POP3 and its procedures supports end-users with intermittent
connections (such as dial-up connections), allowing these users to retrieve e-mail when
connected and then to view and manipulate the retrieved messages without needing to
stay connected. Although most clients have an option to leave mail on server, e-mail
clients using POP3 generally connect, retrieve all messages, store them on the user's PC
as new messages, delete them from the server, and then disconnect.

1.2.7 Internet Message Access Protocol (IMAP)

In contrast, the newer, more capable Internet Message Access Protocol (IMAP)
supports both connected (online) and disconnected (offline) modes of operation. E-mail

9
clients using IMAP generally leave messages on the server until the user explicitly
deletes them. This and other aspects of IMAP operation allow multiple clients to access
the same mailbox. Most e-mail clients support either POP3 or IMAP to retrieve
messages; however, fewer Internet Service Providers (ISPs) support IMAP. The
fundamental difference between POP3 and IMAP4 is that POP3 offers access to a mail
drop; the mail exists on the server until it is collected by the client. Even if the client
leaves some or all messages on the server, the client's message store is considered
authoritative. In contrast, IMAP4 offers access to the mail store; the client may store local
copies of the messages, but these are considered to be a temporary cache; the server's
store is authoritative.

Clients with a leave mail on server option generally use the POP3 UIDL (Unique
IDentification Listing) command. Most POP3 commands identify specific messages by
their ordinal number on the mail server. This creates a problem for a client intending to
leave messages on the server, since these message numbers may change from one
connection to the server to another. For example if a mailbox contains five messages at
last connect, and a different client then deletes message #3, the next connecting user will
find the last two messages' numbers decremented by one.

UIDL provides a mechanism to avoid these numbering issues. The server assigns
a string of characters as a permanent and unique ID for the message. When a POP3-
compatible e-mail client connects to the server, it can use the UIDL command to get the
current mapping from these message IDs to the ordinal message numbers. The client can
then use this mapping to determine which messages it has yet to download, which saves
time when downloading. IMAP has a similar mechanism, a 32-bit unique identifier (UID)
that must be assigned to messages in ascending (although not necessarily consecutive)
order as they are received. Because IMAP UIDs are assigned in this manner, to retrieve
new messages an IMAP client need only request the UIDs greater than the highest UID
among all previously-retrieved messages, whereas a POP client must fetch the entire
UIDL map. For large mailboxes, this difference can be significant.

10
Whether using POP3 or IMAP to retrieve messages, e-mail clients typically use
the SMTP_Submit profile of the Simple Mail Transfer Protocol (SMTP) to send
messages. E-mail clients are commonly categorized as either POP or IMAP clients, but in
both cases the clients also use SMTP.

There are extensions to POP3 that allow some clients to transmit outbound mail
via POP3 - these are known as "XTND XMIT" extensions. The Qualcomm qpopper and
CommuniGate Pro servers and Eudora clients are examples of systems that optionally
utilize the XTND XMIT methods of authenticated client-to-server e-mail transmission.

MIME serves as the standard for attachments and non-ASCII text in e-mail.
Although neither POP3 nor SMTP require MIME-formatted e-mail, essentially all
Internet e-mail comes MIME-formatted, so POP clients must also understand and use
MIME. IMAP, by design, assumes MIME-formatted e-mail.

Like many other older Internet protocols, POP3 originally supported only an unencrypted
login mechanism. Although plain text transmission of passwords in POP3 still commonly
occurs, POP3 currently supports several authentication methods to provide varying levels
of protection against illegitimate access to a user's e-mail. One such method, APOP, uses
the MD5 hash function in an attempt to avoid replay attacks and disclosure of the shared
secret. Clients implementing APOP include Mozilla Thunderbird, Opera, Eudora, KMail,
Novell Evolution, Windows Live Mail, PowerMail, and Mutt. POP3 clients can also
support SASL authentication methods via the AUTH extension. MIT Project Athena also
produced a Kerberized version.

POP3 works over a TCP/IP connection using TCP on network port 110. E-mail
clients can encrypt POP3 traffic using TLS or SSL. A TLS or SSL connection is
negotiated using the STLS command. Some clients and servers, like Google Gmail,
instead use the deprecated alternate-port method, which uses TCP port 995.

1.2.8 Multipurpose Internet Mail Extensions (MIME)

11
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends
the format of e-mail to support:

 text in character sets other than US-ASCII;


 non-text attachments;
 message bodies with multiple parts
 header information in non-ASCII character sets.

MIME's use, however, has grown beyond describing the content of e-mail to
describing content type in general.

Virtually all human-written Internet e-mail and a fairly large proportion of automated
e-mail is transmitted via SMTP in MIME format. Internet e-mail is so closely associated
with the SMTP and MIME standards that it is sometimes called SMTP/MIME e-mail.[1]

The content types defined by MIME standards are also of importance outside of e-
mail, such as in communication protocols like HTTP for the World Wide Web. HTTP
requires that data be transmitted in the context of e-mail-like messages, even though the
data may not actually be e-mail.

MIME defines mechanisms for sending other kinds of information in e-mail. These
include text in languages other than English using character encodings other than ASCII,
and 8-bit binary content such as files containing images, sounds, movies, and computer
programs. MIME is also a fundamental component of communication protocols such as
HTTP, which requires that data be transmitted in the context of e-mail-like messages
even though the data might not fit this context. Mapping messages into and out of MIME
format is typically done automatically by an e-mail client or by mail servers when
sending or receiving Internet (SMTP/MIME) e-mail.

The basic format of Internet e-mail is defined in RFC 2822, which is an updated
version of RFC 822. These standards specify the familiar formats for text e-mail headers
and body and rules pertaining to commonly used header fields such as "To:", "Subject:",
"From:", and "Date:". MIME defines a collection of e-mail headers for specifying

12
additional attributes of a message including content type, and defines a set of transfer
encodings which can be used to represent 8-bit binary data using characters from the 7-bit
ASCII character set. MIME also specifies rules for encoding non-ASCII characters in e-
mail message headers, such as "Subject:", allowing these header fields to contain non-
English characters.

MIME is extensible. Its definition includes a method to register new content types and
other MIME attribute values.

The goals of the MIME definition included requiring no changes to extant e-mail
servers, and allowing plain text e-mail to function in both directions with extant clients.
These goals were achieved by using additional RFC 822-style headers for all MIME
message attributes and by making the MIME headers optional with default values
ensuring a non-MIME message is interpreted correctly by a MIME-capable client. A
simple MIME text message is therefore likely to be interpreted correctly by a non-MIME
client although it has e-mail headers the non-MIME client won't know how to interpret.
Similarly, if the quoted printable transfer encoding (see below) is used, the ASCII part of
the message will be intelligible to users with non-MIME clients.

MIME headers

MIME-Version

The presence of this header indicates the message is MIME-formatted. The value is
typically "1.0" so this header appears as

MIME-Version: 1.0

It should be noted that implementers have attempted to change the version number in the
past and the change had unforeseen results. It was decided at an IETF meeting[citation needed]
to leave the version number as is even though there have been many updates and versions
of MIME.

Content-Type

13
This header indicates the Internet media type of the message content, consisting of a type
and subtype, for example

Content-Type: text/plain

Through the use of the multipart type, MIME allows messages to have parts arranged in a
tree structure where the leaf nodes are any non-multipart content type and the non-leaf
nodes are any of a variety of multipart types. This mechanism supports:

 simple text messages using text/plain (the default value for "Content-type:")
 text plus attachments (multipart/mixed with a text/plain part and other non-text
parts). A MIME message including an attached file generally indicates the file's
original name with the "Content-disposition:" header, so the type of file is
indicated both by the MIME content-type and the (usually OS-specific) filename
extension
 reply with original attached (multipart/mixed with a text/plain part and the
original message as a message/rfc822 part)
 alternative content, such as a message sent in both plain text and another format
such as HTML (multipart/alternative with the same content in text/plain and
text/html forms)
 image, audio, video and application (for example, image/jpg, audio/mp3,
video/mp4, and application/msword and so on)
 many other message constructs

Content-Disposition

The original MIME specifications only provided a means to associate filenames with
application/octet-stream parts. This was done through the use of a name= parameter on
the content-type. The theory here was that filenames were mostly used for type
information and therefore did not need to be present in most cases. It was a mistake. The
specification of content-disposition attempted to provide a more general means of
providing file name information by defining a filename parameter as part of the content-
disposition field.[1]

14
The following example is taken from RFC 2183, where the header is defined

Content-Disposition: attachment; filename=genome.jpeg;


modification-date="Wed, 12 Feb 1997 16:29:51 -0500";

The filename may be encoded as defined by RFC 2231. Besides attachment, one can
specify inline, or any other disposition type. Unfortunately, no name is defined for the
nominal "default" disposition that corresponds to no content-disposition being present.
Thus the recommended practice for generating agents is to only include filename
information when it is necessary, also to avoid leaking sensitive information. If filename
information has to be included, an agent should either put it in a filename= parameter or
both a filename= and name= parameter. Never ever use just a name= parameter because
that opens up to gratuitous interpretation of the part using an unintended disposition
value.[1]

Content-Transfer-Encoding

In June 1992, MIME (RFC 1341, since obsoleted by RFC 2045) defined a set of methods
for representing binary data in ASCII text format. The content-transfer-encoding: MIME
header has 2-sided significance:

1. It indicates whether or not a binary-to-text encoding scheme has been used on top
of the original encoding as specified within the Content-Type header, and
2. If such a binary-to-text encoding method has been used it states which one.

The RFC and the IANA's list of transfer encodings define the values shown below, which
are not case sensitive. Note that '7bit', '8bit', and 'binary' mean that no binary-to-text
encoding on top of the original encoding was used. In these cases, the header is actually
redundant for the email client to decode the message body, but it may still be useful as an
indicator of what type of object is being sent. Values 'quoted-printable' and 'base64' tell
the email client that a binary-to-text encoding scheme was used and that appropriate
initial decoding is necessary before the message can be read with its original encoding
(e.g. UTF-8).

15
 Suitable for use with normal SMTP:
o 7bit — up to 998 octets per line of the code range 1..127 with CR and LF
(codes 13 and 10 respectively) only allowed to appear as part of a CRLF
line ending. This is the default value.
o quoted-printable — used to encode arbitrary octet sequences into a form
that satisfies the rules of 7bit. Designed to be efficient and mostly human
readable when used for text data consisting primarily of US-ASCII
characters but also containing a small proportion of bytes with values
outside that range.
o base64 — used to encode arbitrary octet sequences into a form that
satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit data.
Sometimes used for text data that frequently uses non-US-ASCII
characters.
 Suitable for use with SMTP servers that support the 8BITMIME SMTP
extension:
o 8bit — up to 998 octets per line with CR and LF (codes 13 and 10
respectively) only allowed to appear as part of a CRLF line ending.
 Suitable only for use with SMTP servers that support the BINARYMIME SMTP
extension (RFC 3030):
o binary — any sequence of octets.

Multipart subtypes

The MIME standard defines various multipart-message subtypes, which specify the
nature of the message parts and their relationship to one another. The subtype is specified
in the "Content-Type" header of the overall message. For example, a multipart MIME
message using the digest subtype would have its Content-Type set as "multipart/digest".

The RFC initially defined 4 subtypes: mixed, digest, alternative and parallel. A minimally
compliant application must support mixed and digest; other subtypes are optional.
Additional subtypes, such as signed and form-data, have since been separately defined in
other RFCs.

16
The following is a list of the most commonly used subtypes; it is not intended to be a
comprehensive list.

Mixed

Multipart/mixed is used for sending files with different "Content-Type" headers inline (or
as attachments). If sending pictures or other easily readable files, most mail clients will
display them inline (unless otherwise specified with the "Content-disposition" header).
Otherwise it will offer them as attachments. The default content-type for each part is
"text/plain".

Message

A message/rfc822 part contains an email message, including any headers. Rfc822 is a


misnomer, since the message may be a full MIME message. This is used for digests as
well as for E-mail forwarding.

Digest

Multipart/digest is a simple way to send multiple text messages. The default content-type
for each part is "message/rfc822".

Alternative

The multipart/alternative subtype indicates that each part is an "alternative" version of the
same (or similar) content, each in a different format denoted by its "Content-Type"
header. The formats are ordered by how faithful they are to the original, with the least
faithful first and the most faithful last. Systems can then choose the "best" representation
they are capable of processing; in general, this will be the last part that the system can
understand, although other factors may affect this.

Since a client is unlikely to want to send a version that is less faithful than the plain text
version this structure places the plain text version (if present) first. This makes life easier
for users of clients that do not understand multipart messages.

17
Most commonly multipart/alternative is used for email with two parts, one plain text
(text/plain) and one HTML (text/html). The plain text part provides backwards
compatibility while the HTML part allows use of formatting and hyperlinks. Most email
clients offer a user option to prefer plain text over HTML; this is an example of how local
factors may affect how an application chooses which "best" part of the message to
display.

While it is intended that each part of the message represent the same content, the standard
does not require this to be enforced in any way. At one time, anti-spam filters would only
examine the text/plain part of a message, because it is easier to parse than the text/html
part. But spammers eventually took advantage of this, creating messages with an
innocuous-looking text/plain part and advertising in the text/html part. Anti-spam
software eventually caught up on this trick, penalizing messages with very different text
in a multipart/alternative message.

Related

A multipart/related is used to indicate that message parts should not be considered


individually but rather as parts of an aggregate whole. The message consists of a root part
(by default, the first) which reference other parts inline, which may in turn reference
other parts. Message parts are commonly referenced by the "Content-ID" part header.
The syntax of a reference is unspecified and is instead dictated by the encoding or
protocol used in the part.

One common usage of this subtype is to send a web page complete with images in a
single message. The root part would contain the HTML document, and use image tags to
reference images stored in the latter parts.

Report

Multipart/report is a message type that contains data formatted for a mail server to read.
It is split between a text/plain (or some other content/type easily readable) and a
message/delivery-status, which contains the data formatted for the mail server to read.

18
Signed

A multipart/signed message is used to attach a digital signature to a message. It has two


parts, a body part and a signature part. The whole of the body part, including mime
headers, is used to create the signature part. Many signature types are possible, like
application/pgp-signature (RFC 3156) and application/x-pkcs7-signature (S/MIME).

Encrypted

A multipart/encrypted message has two parts. The first part has control information that
is needed to decrypt the application/octet-stream second part. Similar to signed messages,
there are different implementations which are identified by their separate content types
for the control part. The most common types are "application/pgp-encrypted" (RFC 3156)
and "application/pkcs7-mime" (S/MIME).

Form Data

As its name implies, multipart/form-data is used to express values submitted through a


form. Originally defined as part of HTML 4.0, it is most commonly used for submitting
files via HTTP.

Mixed-Replace (Experimental)

The content type multipart/x-mixed-replace was developed as part of a technology to


emulate server push and streaming over HTTP.

All parts of a mixed-replace message have the same semantic meaning. However, each
part invalidates - "replaces" - the previous parts as soon as it is received completely.
Clients should process the individual parts as soon as they arrive and should not wait for
the whole message to finish.

1.3 World Wide Web

.Originally, one of several systems for organizing Internet-based information


.Competitors: WAIS, Gopher, ARCHIE

19
.Distinctive feature of Web: support for hypertext (text containing links)

.Communication via Hypertext Transport Protocol (HTTP)

.Document representation using Hypertext Markup Language (HTML)

.The Web is the collection of machines (Web servers) on the Internet that provide
information, particularly HTML documents, via HTTP.

.Machines that access information on the Web are known as Web clients. A Web browser
is software used by an end user to access the Web.

20
1.4 Hypertext Transfer Protocol (HTTP)

Hypertext Transfer Protocol (HTTP) is a communications protocol Internet. Its use for
retrieving inter-linked text documents (hypertext) led to the establishment of the World
Wide Web.

HTTP development was coordinated by the World Wide Web Consortium and the
Internet Engineering Task Force (IETF), culminating in the publication of a series of
Request for Comments (RFCs), most notably RFC 2616 (June 1999), which defines
HTTP/1.1, the version of HTTP in common use.

HTTP is a request/response standard between a client and a server. A client is the end-
user, the server is the web site. The client making a HTTP request - using a web browser,
spider, or other end-user tool - is referred to as the user agent. The responding server -
which stores or creates resources such as HTML files and images - is called the origin
server. In between the user agent and origin server may be several intermediaries, such as
proxies, gateways, and tunnels. HTTP is not constrained to using TCP/IP and its
supporting layers, although this is its most popular application on the Internet. Indeed
HTTP can be "implemented on top of any other protocol on the Internet, or on other
networks. HTTP only presumes a reliable transport; any protocol that provides such
guarantees can be used."

Typically, an HTTP client initiates a request. It establishes a Transmission Control


Protocol (TCP) connection to a particular port on a host (port 80 by default; see List of
TCP and UDP port numbers). An HTTP server listening on that port waits for the client
to send a request message. Upon receiving the request, the server sends back a status line,
such as "HTTP/1.1 200 OK", and a message of its own, the body of which is perhaps the
requested file, an error message, or some other information.

HTTP uses TCP and not UDP because much data must be sent for a webpage, and TCP
provides transmission control, presents the data in order, and provides error correction.
See the difference between TCP and UDP.

21
1.5 HTTP Request message

The request message consists of the following:

 Request line, such as GET /images/logo.gif HTTP/1.1, which requests the file
logo.gif from the /images directory
 Headers, such as Accept-Language: en
 An empty line
 An optional message body

The request line and headers must all end with <CR><LF> (that is, a carriage return
followed by a line feed). The empty line must consist of only <CR><LF> and no other
whitespace. In the HTTP/1.1 protocol, all headers except Host are optional.

A request line containing only the path name is accepted by servers to maintain
compatibility with HTTP clients before the HTTP/1.0 specification.

Request methods

A HTTP request made using telnet. The request, response headers and response body are
highlighted.

HTTP defines eight methods (sometimes referred to as "verbs") indicating the desired
action to be performed on the identified resource.

HEAD

Asks for the response identical to the one that would correspond to a GET
request, but without the response body. This is useful for retrieving meta-
information written in response headers, without having to transport the entire
content.

GET

22
Requests a representation of the specified resource. By far the most common
method used on the Web today. Should not be used for operations that cause side-
effects (using it for actions in web applications is a common misuse). See safe
methods below.

POST

Submits data to be processed (e.g. from an HTML form) to the identified


resource. The data is included in the body of the request. This may result in the
creation of a new resource or the updates of existing resources or both.

PUT

Uploads a representation of the specified resource.

DELETE

Deletes the specified resource.

TRACE

Echoes back the received request, so that a client can see what intermediate
servers are adding or changing in the request.

OPTIONS

Returns the HTTP methods that the server supports for specified URL. This can
be used to check the functionality of a web server by requesting '*' instead of a
specific resource.

CONNECT

Converts the request connection to a transparent TCP/IP tunnel, usually to


facilitate SSL-encrypted communication (HTTPS) through an unencrypted HTTP
proxy.

23
HTTP versions

HTTP has evolved into multiple, mostly backwards-compatible protocol versions. RFC
2145 describes the use of HTTP version numbers. The client tells in the beginning of the
request the version it uses, and the server uses the same or earlier version in the response.

HTTP/0.9 (1991)

Deprecated. Supports only one command, GET, which does not specify the HTTP
version. Does not support headers. Since this version does not support POST, the
information a client can pass to the server is limited by the URI length.

HTTP/1.0 (May 1996)

This is the first protocol revision to specify its version in communications and is
still in wide use, especially by proxy servers.

HTTP/1.1 (1997-1999)

Current version; persistent connections enabled by default and works well with
proxies. Also supports request pipelining, allowing multiple requests to be sent at
the same time, allowing the server to prepare for the workload and potentially
transfer the requested resources more quickly to the client.

HTTP/1.2

The initial 1995 working drafts of the document PEP – an Extension Mechanism
for HTTP (which proposed the Protocol Extension Protocol, abbreviated PEP)
were prepared by the World Wide Web Consortium and submitted to the Internet
Engineering Task Force. PEP was originally intended to become a distinguishing
feature of HTTP/1.2.[5] In later PEP working drafts, however, the reference to
HTTP/1.2 was removed. The experimental RFC 2774, HTTP Extension
Framework, largely subsumed PEP. It was published in February 2000.

24
1.6 HTTP Reponse message

The reponse message consists of the following:

 Status line, such as GET /images/logo.gif HTTP/1.1, which requests the file
logo.gif from the /images directory
 Headers, such as Accept-Language: en
 An empty line
 An optional message body

The request line and headers must all end with <CR><LF> (that is, a carriage return
followed by a line feed). The empty line must consist of only <CR><LF> and no other
whitespace. In the HTTP/1.1 protocol, all headers except Host are optional.

A request line containing only the path name is accepted by servers to maintain
compatibility with HTTP clients before the HTTP/1.0 specification.

25
1.7 Web Clients

.Many possible web clients:

.Text-only “browser” (lynx)

.Mobile phones

.Robots (software-only clients, e.g., search engine “crawlers”).etc.

1.7.1 Web Browsers

.First graphical browser running on general-purpose platforms: Mosaic (1993)

LabeledBrowser

Web Browsers

.Primary tasks:

.Convert web addresses (URL’s) to HTTP requests

.Communicate with web servers via HTTP

.Render (appropriately display) documents returned by a server

HTTP URL’s

.Browser uses authority to connect via TCP.Request-URI included in start line (/ used for

path if none supplied)

.Fragment identifier not sent to server (used to scroll browser client area)

http://www.example.org:56789/a/b/c.txt?t=win&s=chess#para5

host (FQDN)

port

authority

26
path

query

fragment

Request-URI

Web Browsers

.Standard features

.Save web page to disk

.Find string in page

.Fill forms automatically (passwords, CC numbers, …)

.Set preferences (language, character set, cache and HTTP parameters)

.Modify display style (e.g., increase font sizes)

.Display raw HTML and HTTP header info (e.g., Last-Modified)

.Choose browser themes (skins)

.View history of web addresses visited

.Bookmark favorite pages for easy return

Web Browsers

.Additional functionality:

.Execution of scripts (e.g., drop-down menus)

.Event handling (e.g., mouse clicks)

.GUI for controls (e.g., buttons)

27
.Secure communication with servers

.Display of non-HTML documents (e.g., PDF) via plug-ins

1.8 Web Servers

.Basic functionality:

.Receive HTTP request via TCP

.Map Host header to specific virtual host(one of many

host names sharing an IP address)

.Map Request-URI to specific resource associated with the virtual host

.File: Return file in HTTP response

.Program: Run program and return output in HTTP response

.Map type of resource to appropriate MIME type and use to set Content-Type header in
HTTP response

.Log information about the request and response

Web Servers

.httpd: UIUC, primary Web server c. 1995

.Apache: “A patchy” version of httpd, now the most popular server (esp. on Linux
platforms)

.IIS: Microsoft Internet Information Server

.Tomcat:

28
.Java-based

.Provides container (Catalina) for running Java servlets

(HTML-generating programs) as back-end to Apache or IIS

.Can run stand-alone using Coyote HTTP front-end

Web Servers

.Some Coyote communication parameters:

.Allowed/blocked IP addresses

.Max. simultaneous active TCP connections

.Max. queued TCP connection requests

.“Keep-alive” time for inactive TCP connections

.Modify parameters to tune server

performance

Web Servers

.Some Catalina container parameters:

.Virtual host names and associated ports

.Logging preferences

.Mapping from Request-URI’s to server

resources

.Password protection of resources

.Use of server-side caching

29
Tomcat Web Server

.HTML-based server administration

.Browse tohttp://localhost:8080and click on Server Administration link

.localhostis a special host name that means “this machine”

.Some Connector fields:

.Port Number: port “owned” by this connector

.Max Threads: max connections processed simultaneously

.Connection Timeout: keep-alive time

.Each Host is a virtual host (can have multiple

per Connector)

.Some fields:

.Host: localhost or a fully qualified domain name

.Application Base: directory (may be path relative to JWSDP installation directory)


containing resources associated with this Host

.Context provides mapping from Request-URI path to a web application

.Document Base field is directory (possibly relative to Application Base) that contains
resources for this web application

.For this example, browsing tohttp://localhost:8080/ returns resource fromc:\jwsdp-


1.3\webapps\ROOT .Returns index.html (standard welcome file)

30
Tomcat Web Server

.Access log records HTTP requests

.Parameters set using AccessLogValve

.Default location: logs/access_log.*under JWSDP installation directory

.Example “common” log format entry (one line):

www.example.org -admin

[20/Jul/2005:08:03:22 -0500]

"GET /admin/frameset.jsp HTTP/1.1" 200 920

Tomcat Web Server

.Other logs provided by default in JWSDP:

.Message log messages sent to log service by web applications or Tomcat itself

.logs/jwsdp_log.*: default message log

.logs/localhost_admin_log.*: message log

for web apps within /admin context

.System.out and System.err output (exception traces often found here):

.logs/launcher.server.log

Tomcat Web Server

.Access control:

.Password protection (e.g., admin pages)

.Users and roles defined inconf/tomcat-users.xml

31
.Deny access to machines

.Useful for denying access to certain users by denying access from the machines they use

.List of denied machines maintained in RemoteHostValve (deny by host name) or

RemoteAddressValve (deny by IP address)

Secure Servers

.Since HTTP messages typically travel over a public network, private information (such
as credit card numbers) should be encrypted to prevent eavesdropping

.https URL scheme tells browser to use encryption

.Common encryption standards:

.Secure Socket Layer (SSL)

.Transport Layer Security (TLS)

1.9 Markup Languages:

1.10 HTML History

.1990: HTML invented by Tim Berners-Lee

.1993: Mosaic browser adds support for images, sound, video to HTML

.1994-~1997: “Browser wars” between Netscape and Microsoft, HTML defined

operationally by browser support

.~1997-present: Increasingly, World-Wide Web Consortium (W3C) recommendations

define HTML

32
1.11 HTML Versions

.HTML 4.01 (Dec 1999) syntax defined using Standard Generalized Markup Language
(SGML) .XHTML 1.0 (Jan 2000) syntax defined using Extensible Markup Language
(XML)

.Primary differences:

HTML allows some tag omissions (e.g., end tags)

XHTML element and attribute names are lower case (HTML names are case-insensitive)

XHTML requires that attribute values be quoted

1.12 Fundamental HTML Elements


In HTML, the document is structured into elements, marked up by tags that are keywords
contained in pairs of angle brackets.
Each document is structured into two parts - <head> and <body>. The head contains the
information which is information about the document that is not generally displayed with
the document, such as its <title>. The body contains the actual text that is made up of
paragraphs, lists, and other elements. The contents of the body is displayed in a browser
window.
Every HTML document should contain certain standard elements. The required elements
are:
<html></html> encloses the entire document and defines it as HTML document.
<head></head> comes after the opening <html> tag and contains the <title>.
<title></title> contains the name of the document and must be enclosed by <head> tags.
<body></body> contains all the rest of the document.
The minimal HTML document could contain just those elements (such document,
however, will remain empty
on screen, since its body is empty)

33
Example:

<html>
<head>
<title>Internet programming</title>
</head>
<body>
</body>
</html>

Document head
The head element contains general information, or meta-information, about the
document. What element can appear in the head depends on HTML version. Some
elements:
<title>
The title of the document. All document must have a title.
<base>
A record of the original URI of the document: this allows you to move the document to a
new location and have relative URIs access the appropriate place with respect to the
original URI.
<link>
Defines the relationship(s) between this document and another or others. A document can
have several <link> elements.
<meta>
A container for document metainformation.
<style>
Stylesheet instructions, written in a stylesheet language. Stylesheet instructions specify
how the document should be formatted for display.
<script>
A code of client-side script in the document. Example languages are JavaScript and
VBScript.

Example:

34
<head>
<title>Internet programming</title>
<base href="http://www.it.lut.fi/index.html">
<link rel="stylesheet" type="text/css" href="courses.css">
<link href="toc.html" rel="contents">
<link href="slide2.html" rel="next">
<style>
BODY,TD,TH,UL,DL,OL,H1,H2,H3,H4 {
font-family: Arial, Helvetica, sans-serif;
}
.smaller {
font-size: 9pt;
}
</style>
<script type="text/javascript" src="foo.js" charset="ISO-8859-1">
<!--
// embedded script, only executed if foo.js is unavailable
document.write("foo is gone");
// -->
</script>
</head>

Document body
The body element contains the actual content of the page, in other words, that what we
want to show on the page. There are several basic types of elements:
block-level elements describing a structure of the document, as for example <hn> for
headings, <p> for
marking paragraphs, ...
text-level elements for marking a style of the text, either logical (e.g. <em>, <strong>,...)
or physical
(<i>, ...)

35
Character-level elements and character references. Character references or entities have
two functions:
escaping special characters (e.g. write &lt; to get "<" on the screen)
displaying other characters not available in the plain ASCII character set (e.g. write
&yen; to get
japanise currency symbol ¥, or &deg; for degree symbol °)
hypertext anchors

Block-level elements
<hn> 6 levels of headings
<p> Paragraphs
<pre> Generates text in a fixed-width font. This element also preserves spaces, new lines,
and tabs.
<blockquote>, <q> Lengthy quotations in a separate block on the screen. Most browsers
generally
change the margins for the quotation to separate it from surrounding text.
Wrongly used to indent text!
<br>, <hr> Forced line breaks and horizontal lines.
<sub>, <sup> Subscripts and superscripts.
<ul>, <ol>, <dl>, <li> Unnumbered, numbered, and definition lists. You can nest lists
too.
<table>

Example 1: Lists
<ol>
<li> A few New England states:
<ul>
<li> Vermont
<li> New Hampshire
<li> Maine
</ul>
<li> Two Midwestern states:

36
<ul>
<li> Michigan
<li> Indiana
</ul>
</ol>
1. A few New England states:
Vermont
New Hampshire
Maine
2. Two Midwestern states:
Michigan
Indiana

Tables
Tables have been included into HTML specification in version 3.2. Until then, authors
had to carefully format
their tabular information within <pre> tags, counting spaces and previewing their output.
Tables are very useful for presentation of tabular information as well as a boon to
creative HTML authors who
use the table tags to present their regular Web pages, despite using tables for layout is not
recommended.
A table has heads where you explain what the columns/rows include, rows for
information, cells for each item.

Example 1: Simple table


<table>
<caption> caption contents </caption>
<tr>
<th> first header cell contents </th>
<th> last header cell contents </th>
</tr>
<tr>

37
<td> first row, first cell contents </td>
<td> first row, last cell contents </td>
</tr>
<tr>
<td> last row, first cell contents </td>
<td> last row, last cell contents </td>
</tr>
</table>

Linking to other HTML documents


The most important capability of HTML, which make HTML a language for
creating hypertext, is its ability to create hyperlinks to text elsewhere in the same page, to
another page on the server or on different server.
The links to another piece of text or object (for example image) are marked by
anchor element <a>. The target or start of hyperlink is specified by an attribute of
element <a>. Several types of anchors can be used:
<a href="whereto">anchor</a> for linking to an object with URI "whereto"
<a name="thistext> where the identifier "thistext" is used to name the anchored text as
the possible target of a hypertext link. Named anchors can be targeted, from within the
same or other document using <href> prepending a #.
From the same document:
<a href="#thistext">Poisonous mushrooms</a>
From another document:
<a href="http://www.snakes.com/poisonous.html#thistext">Poisonous snakes</a>

Including an image
In addition to src attribute, you should include two other attributes on <img> tags to tell
your browser the size of the images it is downloading. The height and width attributes let
your browser set aside the appropriate space for the images as it downloads the rest of the
file.
Some web browsers -- primarily the text-only browsers such as Lynx -- cannot display
images. HTML provides

38
a mechanism to tell readers what they are missing on your pages if they can’t load
images. The alt attribute lets you specify text to be displayed instead of an image.
For example:
<img src="UpArrow.gif" alt="Up">

HTML Forms

HTML forms are used to pass data to a server.

A form can contain input elements like text fields, checkboxes, radio-buttons, submit
buttons and more. A form can also contain select lists, textarea, fieldset, legend, and label
elements.

The <form> tag is used to create an HTML form:

<form>
.input elements
.
</form>

HTML Forms - The Input Element

The most important form element is the input element.

The input element is used to select user information.

An input element can vary in many ways, depending on the type attribute. An input
element can be of type text field, checkbox, password, radio button, submit button, and
more.

39
The most used input types are described below.

Text Fields

<input type="text" /> defines a one-line input field that a user can enter text into:

<form>
First name: <input type="text" name="firstname" /><br />
Last name: <input type="text" name="lastname" />
</form>

How the HTML code above looks in a browser:

First name:

Last name:

Note: The form itself is not visible. Also note that the default width of a text field is 20
characters.

Password Field

<input type="password" /> defines a password field:

<form>
Password: <input type="password" name="pwd" />
</form>

How the HTML code above looks in a browser:

Password:

40
Radio Buttons

<input type="radio" /> defines a radio button. Radio buttons let a user select ONLY ONE
one of a limited number of choices:

<form>
<input type="radio" name="sex" value="male" /> Male<br />
<input type="radio" name="sex" value="female" /> Female
</form>

How the HTML code above looks in a browser:

Male

Female

Checkboxes

<input type="checkbox" /> defines a checkbox. Checkboxes let a user select ONE or
MORE options of a limited number of choices.

<form>
<input type="checkbox" name="vehicle" value="Bike" /> I have a bike<br />
<input type="checkbox" name="vehicle" value="Car" /> I have a car
</form>

How the HTML code above looks in a browser:

I have a bike

I have a car

41
Submit Button

<input type="submit" /> defines a submit button.

A submit button is used to send form data to a server. The data is sent to the page
specified in the form's action attribute. The file defined in the action attribute usually
does something with the received input:

<form name="input" action="html_form_action.asp" method="get">


Username: <input type="text" name="user" />
<input type="submit" value="Submit" />
</form>

How the HTML code above looks in a browser:

Submit
Username:

If you type some characters in the text field above, and click the "Submit" button, the
browser will send your input to a page called "html_form_action.asp". The page will
show you the received input.

HTML Frames

With frames, you can display more than one HTML document in the same browser
window. Each HTML document is called a frame, and each frame is independent of the
others.

The disadvantages of using frames are:

 The web developer must keep track of more HTML documents


 It is difficult to print the entire page

42
The HTML frameset Element

The frameset element holds two or more frame elements. Each frame element holds a
separate document.

The frameset element states only HOW MANY columns or rows there will be in the
frameset.

The HTML frame Element

The <frame> tag defines one particular window (frame) within a frameset.

In the example below we have a frameset with two columns.

The first column is set to 25% of the width of the browser window. The second column is
set to 75% of the width of the browser window. The document "frame_a.htm" is put into
the first column, and the document "frame_b.htm" is put into the second column:

<framesetcols="25%,75%">
<frame src="frame_a.htm" />
<frame src="frame_b.htm" />
</frameset>

Note: The frameset column size can also be set in pixels (cols="200,500"), and one of the
columns can be set to use the remaining space, with an asterisk (cols="25%,*").

Basic Notes - Useful Tips

Tip: If a frame has visible borders, the user can resize it by dragging the border. To
prevent a user from doing this, you can add noresize="noresize" to the <frame> tag.

43
HTML Frame Tags
Tag Description

<frameset> Defines a set of frames

<frame /> Defines a sub window (a frame)

<noframes> Defines a noframe section for browsers that do not handle frames

<iframe> Defines an inline sub window (frame)

XML

We have some drawbacks in HTML forcing them to seek new avenues. The
following are some points of concern to the internet community and serious programmers
as well.

1. HTML lacks syntactic checking: This means, there is no scientific method for
validating its code.

2. HTML lacks structure. For example HTML has an ordered set of heading tags (h1 to
h6) and browsers do not care how the heading tags are nested. That is, there is no
hierarchial ordering for these tags. This is somewhat uncomfortable.

3. HTML is not international. There is no tag in HTML to identify the language used.

4. HTML is not object-oriented.

5. HTML lacks a robust linking mechanism. HTML’s links are very much one-to-one,
with the linking hard-coded in the source HTML files. If the location of one target file
changes, a Webmaster may have to update dozens or even hundreds of other pages.

44
6. HTML is not reusable

7. HTML is not extensible.

8. HTML tags are not content-aware. So complex databases, mathematical equations,


chemical formulae etc., cannot be handled.

Thus we see that there are reasons enough, why people should look for some
alternative and in that respect XML seems to fulfil many of the demands of the present
day web managers. Will XML replace HTML completely? As on date it looks that
XML will only supplement HTML and not supplant it.

What XML?

XML stands for eXtensible Markup Language. It is really a fantastic presentation. It


is evolved from a more generalized markup language known as SGML(Standard
Generalized Markup Languuage) released during 1985-86. SGML is a very complex
language mainly used in big government departments, companies and army for storing
and transferring large volume of data in electronic form. Since SGML was too
cumbersome for use in smaller establishments, XML was developed during the years
1996-97 which is supposed to contain all the salient features of SGML sans its
complexity.
Evolution of Markup Languages:

The main purpose of Markup languages is to provide a cross-platform tool for safe
transmission of information in the form of texts, pictures etc. through internet. People
receiving the information through the internet, look not merely to its contents but also to
its presentation. For purposes of good presentation, markups or tags were used for
identification. For example any information between two markups (<b>..</b>) will
appear bold.

45
HTML versus XML:

HTML is good enough when documents could be expressed in the form of texts, lists
or tables. But not all documents could be expressed in such simple forms.

Design Goals of XML:

The design goals for XML were set up as follows and the language was developed on
these lines:

1. XML shall be straightforwardly usable over the Internet.


2. XML shall support a wide variety of applications.
3. XML shall be compatible with SGML.
4. It shall be easy to write programs which process XML documents.
5. The number of optional features in XML is to be kept to the absolute minimum, ideally
zero.
6. XML documents should be human-legible and reasonably clear.
7. The XML design should be prepared quickly.
8. The design of XML shall be formal and concise.
9. XML documents shall be easy to create.
10. Terseness is of minimal importance.

Steps involved in creating XML Programs?

We must first learn about the basic building blocks of an XML document like
elements and entities. Then we must learn how to build an XML document which is
well-formed and also valid.

1. In order to check the validity of an XML document, we must write another document
called the DTD, which gives the grammar rules pertaining to the construction and
organization of elements and entities.

46
2. As we have said before, XML documents by themselves have no powers for
processing and display and so we have to write supporting programs either CSS or
XSL depending on the requirement and check their correctness.

3. The final step is to open the XML document in the browser window, when you will
get the desired output.

A simple XML Document:

Let us consider a simple XML document as shown in the figure 1, which gives
information about two books, their titles, the authors’ names and their codes.

<?xml version=”1.0”?>

<booklist>

books

<code>isb6734</code>

<title>HTML programming</title>

<author>jackson</author>

</books>

<books>
<code>idr3562</code>

<title>DHTML programming</title>

<author>jeffrey</author>

</books>

47
</booklist>

Attribute Markup:

Suppose we want to add an attribute to the author’s name regarding sex, we can
simply add the attribute by writing
<author sex=”m”>jeffrey</author>

In the HTML program you have to open another column and completely recast the
program.
Empty Tags:

XML allows empty tags, i.e., tags without any data in between and they can be written
in a shortened form ‘<pincode/>’ instead of writing ‘<pincode> </pincode>’. In the
shortened form the slash sign is in the trailing position, whereas in the normal form the
slash sign is in the leading position. Usually empty tags have one or more attributes
whose values are entered as ‘<pincode=”56789”/>’.

Viewing the XML document in IE4:

<html>
<head>
<title>my personal library</title>
</head>
<body>
<hr>
<p>my personal library books are:</p>
<APPLET code="com.ms.xml.dso.XMLDSO.class"
id=xmldso
width=0
height=0
mayscript='true' datasrc=#xmldso>

48
<param name="URL" value="booklist.xml">
</APPLET>
<table border=1 DATASRC="#xmldso">
<thead>
<th>code</th>
<th>title</th>
<th>author</th>
</thead>
<tr><td><span DATAFLD="code"></span></td>
<td><span DATAFLD="title"></span></td>
<td><span DATAFLD="author"></span></td></tr></table>
</body>
</html>

49
Creating Your HTML Document

An HTML document contains two distinct parts, the head and the body. The head
contains information about the document that is not displayed on the screen. The body
then contains everything else that is displayed as part of the web page.

The basic structure then of any HTML page is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">


<html>
<head>
<!-- header info used to contain extra information about
this document, not displayed on the page -->

50
</head>

<body>

<!-- all the HTML for display -->


: :
: :
: :
</body>
</html>

The very first line:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

is not technically required, but is a code that tells the browser what version of HTML the
current page is written for.

Enclose all HTML content within <html>...</html> tags. Inside is first your
<head>...</head> and then the <body>...</body> sections.

Here are the steps for creating your first HTML file. Are you ready?

1. If it is not open already, launch your text editor program.


2. Go to the text editor window.
3. Enter the following text (you do not have to press RETURN at the end of each
line; the web browser will word wrap all text):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">


<html>
<head>
<title>Volcano Web</title>
</head>

51
<!-- written for the Writing HTML Tutorial
by Lorrie Lava, February 31, 1999 -->
<body>
In this lesson you will use the Internet to research
information on volcanoes and then write a report on
your results.
</body>
</html>

4. Save the document as a file called "ex.html" and keep it in the "work area"
folder/directory you set up for this tutorial. Also, if you are using a word
processor program to create your HTML, be sure to save in plain text (or ASCII)
format.

By using this file name extension, a web browser will know to read these text files
as HTML and properly display the web page.

Displaying Your Document in a Web Browser

1. Return to the web browser window you are using for your "work space".
2. Select Open File... from the File menu. (Note: For users of Internet Explorer,
click the Browse button to select your file)
3. Use the dialog box to find and open the file you created, "ex.html"
4. You should now see in the title bar of the workspace window the text "ex Web".

52

You might also like