@@ -5,19 +5,23 @@ \section{Standard Module \sectcode{htmllib}}
55
66\renewcommand {\indexsubitem }{(in module htmllib)}
77
8- This module defines a number of classes which can serve as a basis for
9- parsing text files formatted in HTML (HyperText Mark-up Language).
10- The classes are not directly concerned with I/O --- the have to be fed
11- their input in string form, and will make calls to methods of a
12- `` formatter'' object in order to produce output. The classes are
13- designed to be used as base classes for other classes in order to add
14- functionality, and allow most of their methods to be extended or
15- overridden. In turn, the classes are derived from and extend the
16- class \code {SGMLParser} defined in module \code {sgmllib}.
8+ This module defines a class which can serve as a base for parsing text
9+ files formatted in the HyperText Mark-up Language (HTML). The class
10+ is not directly concerned with I/O --- it must be provided with input
11+ in string form via a method, and makes calls to methods of a
12+ `` formatter'' object in order to produce output. The
13+ \code {HTMLParser} class is designed to be used as a base class for
14+ other classes in order to add functionality, and allows most of its
15+ methods to be extended or overridden. In turn, this class is derived
16+ from and extends the \code {SGMLParser} class defined in module
17+ \code {sgmllib}. Two implementations of formatter objects are
18+ provided in the \code {formatter} module; refer to the documentation
19+ for that module for information on the formatter interface.
1720\index {SGML}
1821\stmodindex {sgmllib}
1922\ttindex {SGMLParser}
2023\index {formatter}
24+ \stmodindex {formatter}
2125
2226The following is a summary of the interface defined by
2327\code {sgmllib.SGMLParser}:
@@ -27,15 +31,17 @@ \section{Standard Module \sectcode{htmllib}}
2731\item
2832The interface to feed data to an instance is through the \code {feed()}
2933method, which takes a string argument. This can be called with as
30- little or as much text at a time as desired;
31- \code {p.feed(a); p.feed(b)} has the same effect as \code {p.feed(a+b)}.
32- When the data contains complete
33- HTML elements, these are processed immediately; incomplete elements
34- are saved in a buffer. To force processing of all unprocessed data,
35- call the \code {close()} method.
36-
37- Example: to parse the entire contents of a file, do\\
38- \code {parser.feed(open(file).read()); parser.close()}.
34+ little or as much text at a time as desired; \code {p.feed(a);
35+ p.feed(b)} has the same effect as \code {p.feed(a+b)}. When the data
36+ contains complete HTML tags, these are processed immediately;
37+ incomplete elements are saved in a buffer. To force processing of all
38+ unprocessed data, call the \code {close()} method.
39+
40+ For example, to parse the entire contents of a file, use:
41+ \begin {verbatim }
42+ parser.feed(open('myfile.html').read())
43+ parser.close()
44+ \end {verbatim }
3945
4046\item
4147The interface to define semantics for HTML tags is very simple: derive
@@ -52,223 +58,60 @@ \section{Standard Module \sectcode{htmllib}}
5258
5359\end {itemize }
5460
55- The module defines the following classes:
56-
57- \begin {funcdesc }{HTMLParser}{}
58- This is the most basic HTML parser class. It defines one additional
59- entity name over the names defined by the \code {SGMLParser} base
60- class, \code {\& bullet;}. It also defines handlers for the following
61- tags: \code {<LISTING>...</LISTING>}, \code {<XMP>...</XMP>}, and
62- \code {<PLAINTEXT>} (the latter is terminated only by end of file).
63- \end {funcdesc }
64-
65- \begin {funcdesc }{CollectingParser}{}
66- This class, derived from \code {HTMLParser}, collects various useful
67- bits of information from the HTML text. To this end it defines
68- additional handlers for the following tags: \code {<A>...</A>},
69- \code {<HEAD>...</HEAD>}, \code {<BODY>...</BODY>},
70- \code {<TITLE>...</TITLE>}, \code {<NEXTID>}, and \code {<ISINDEX>}.
71- \end {funcdesc }
72-
73- \begin {funcdesc }{FormattingParser}{formatter\, stylesheet}
74- This class, derived from \code {CollectingParser}, interprets a wide
75- selection of HTML tags so it can produce formatted output from the
76- parsed data. It is initialized with two objects, a \var {formatter}
77- which should define a number of methods to format text into
78- paragraphs, and a \var {stylesheet} which defines a number of static
79- parameters for the formatting process. Formatters and style sheets
80- are documented later in this section.
81- \index {formatter}
82- \index {style sheet}
83- \end {funcdesc }
61+ The module defines a single class:
8462
85- \begin {funcdesc }{AnchoringParser}{formatter\, stylesheet}
86- This class, derived from \code {FormattingParser}, extends the handling
87- of the \code {<A>...</A>} tag pair to call the formatter's
88- \code {bgn_anchor()} and \code {end_anchor()} methods. This allows the
89- formatter to display the anchor in a different font or color, etc.
63+ \begin {funcdesc }{HTMLParser}{formatter}
64+ This is the basic HTML parser class. It supports all entity names
65+ required by the HTML 2.0 specification (RFC 1866). It also defines
66+ handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
9067\end {funcdesc }
9168
92- Instances of \code {CollectingParser} (and thus also instances of
93- \code {FormattingParser} and \code {AnchoringParser}) have the following
94- instance variables:
95-
96- \begin {datadesc }{anchornames}
97- A list of the values of the \code {NAME} attributes of the \code {<A>}
98- tags encountered.
99- \end {datadesc }
100-
101- \begin {datadesc }{anchors}
102- A list of the values of \code {HREF} attributes of the \code {<A>} tags
103- encountered.
104- \end {datadesc }
105-
106- \begin {datadesc }{anchortypes}
107- A list of the values of the \code {TYPE} attributes of the \code {<A>}
108- tags encountered.
109- \end {datadesc }
110-
111- \begin {datadesc }{inanchor}
112- Outside an \code {<A>...</A>} tag pair, this is zero. Inside such a
113- pair, it is a unique integer, which is positive if the anchor has a
114- \code {HREF} attribute, negative if it hasn't. Its absolute value is
115- one more than the index of the anchor in the \code {anchors},
116- \code {anchornames} and \code {anchortypes} lists.
117- \end {datadesc }
118-
119- \begin {datadesc }{isindex}
120- True if the \code {<ISINDEX>} tag has been encountered.
121- \end {datadesc }
122-
123- \begin {datadesc }{nextid}
124- The attribute list of the last \code {<NEXTID>} tag encountered, or
125- an empty list if none.
126- \end {datadesc }
127-
128- \begin {datadesc }{title}
129- The text inside the last \code {<TITLE>...</TITLE>} tag pair, or
130- \code {''} if no title has been encountered yet.
131- \end {datadesc }
132-
133- The \code {anchors}, \code {anchornames} and \code {anchortypes} lists
134- are `` parallel arrays'' : items in these lists with the same index
135- pertain to the same anchor. Missing attributes default to the empty
136- string. Anchors with neither a \code {HREF} nor a \code {NAME}
137- attribute are not entered in these lists at all.
138-
139- The module also defines a number of style sheet classes. These should
140- never be instantiated --- their class variables are the only behavior
141- required. Note that style sheets are specifically designed for a
142- particular formatter implementation. The currently defined style
143- sheets are:
144- \index {style sheet}
145-
146- \begin {datadesc }{NullStylesheet}
147- A style sheet for use on a dumb output device such as an \ASCII {}
148- terminal.
149- \end {datadesc }
150-
151- \begin {datadesc }{X11Stylesheet}
152- A style sheet for use with an X11 server.
153- \end {datadesc }
154-
155- \begin {datadesc }{MacStylesheet}
156- A style sheet for use on Apple Macintosh computers.
157- \end {datadesc }
158-
159- \begin {datadesc }{StdwinStylesheet}
160- A style sheet for use with the \code {stdwin} module; it is an alias
161- for either \code {X11Stylesheet} or \code {MacStylesheet}.
162- \bimodindex {stdwin}
163- \end {datadesc }
164-
165- \begin {datadesc }{GLStylesheet}
166- A style sheet for use with the SGI Graphics Library and its font
167- manager (the SGI-specific built-in modules \code {gl} and \code {fm}).
168- \bimodindex {gl}
169- \bimodindex {fm}
170- \end {datadesc }
171-
172- Style sheets have the following class variables:
173-
174- \begin {datadesc }{stdfontset}
175- A list of up to four font definititions, respectively for the roman,
176- italic, bold and constant-width variant of a font for normal text. If
177- the list contains less than four font definitions, the last item is
178- used as the default for missing items. The type of a font definition
179- depends on the formatter in use; its only use is as a parameter to the
180- formatter's \code {setfont()} method.
181- \end {datadesc }
69+ In addition to tag methods, the \code {HTMLParser} class provides some
70+ additional methods and instance variables for use within tag methods.
18271
183- \begin {datadesc }{h1fontset}
184- \dataline {h2fontset}
185- \dataline {h3fontset}
186- The font set used for various headers (text inside \code {<H1>...</H1>}
187- tag pairs etc.).
72+ \begin {datadesc }{formatter}
73+ This is the formatter instance associated with the parser.
18874\end {datadesc }
18975
190- \begin {datadesc }{stdindent}
191- The indentation of normal text. This is measured in the `` native''
192- units of the formatter in use; for some formatters these are
193- characters, for others (especially those that actually support
194- variable-spacing fonts) in pixels or printer points.
76+ \begin {datadesc }{nofill}
77+ Boolean flag which should be true when whitespace should not be
78+ collapsed, or false when it should be. In general, this should only
79+ be true when character data is to be treated as `` preformatted'' text,
80+ as within a \code {<PRE>} element. The default value is false. This
81+ affects the operation of \code {handle_data()} and \code {save_end()}.
19582\end {datadesc }
19683
197- \begin {datadesc }{ddindent}
198- The indentation used for the first level of \code {<DD>} tags.
199- \end {datadesc }
200-
201- \begin {datadesc }{ulindent}
202- The indentation used for the first level of \code {<UL>} tags.
203- \end {datadesc }
204-
205- \begin {datadesc }{h1indent}
206- The indentation used for level 1 headers.
207- \end {datadesc }
208-
209- \begin {datadesc }{h2indent}
210- The indentation used for level 2 headers.
211- \end {datadesc }
212-
213- \begin {datadesc }{literalindent}
214- The indentation used for literal text (text inside
215- \code {<PRE>...</PRE>} and similar tag pairs).
216- \end {datadesc }
217-
218- Although no documented implementation of a formatter exists, the
219- \code {FormattingParser} class assumes that formatters have a
220- certain interface. This interface requires the following methods:
221- \index {formatter}
222-
223- \begin {funcdesc }{setfont}{fontspec}
224- Set the font to be used subsequently. The \var {fontspec} argument is
225- an item in a style sheet's font set.
226- \end {funcdesc }
227-
228- \begin {funcdesc }{flush}{}
229- Finish the current line, if not empty, and begin a new one.
84+ \begin {funcdesc }{anchor_bgn}{href\, name\, type}
85+ This method is called at the start of an anchor region. The arguments
86+ correspond to the attributes of the \code {<A>} tag with the same
87+ names. The default implementation maintains a list of hyperlinks
88+ (defined by the \code {href} argument) within the document. The list
89+ of hyperlinks is available as the data attribute \code {anchorlist}.
23090\end {funcdesc }
23191
232- \begin {funcdesc }{setleftindent}{n}
233- Set the left indentation of the following lines to \var {n} units.
92+ \begin {funcdesc }{anchor_end}{}
93+ This method is called at the end of an anchor region. The default
94+ implementation adds a textual footnote marker using an index into the
95+ list of hyperlinks created by \code {anchor_bgn()}.
23496\end {funcdesc }
23597
236- \begin {funcdesc }{needvspace}{n}
237- Require at least \var {n} blank lines before the next line. Implies
238- \code {flush()}.
98+ \begin {funcdesc }{handle_image}{source\, alt\optional {\, ismap\optional {\, align\optional {\, width\optional {\, height}}}}}
99+ This method is called to handle images. The default implementation
100+ simply passes the \code {alt} value to the \code {handle_data()}
101+ method.
239102\end {funcdesc }
240103
241- \begin {funcdesc }{addword}{word\, space}
242- Add a \var {word} to the current paragraph, followed by \var {space}
243- spaces.
104+ \begin {funcdesc }{save_bgn}{}
105+ Begins saving character data in a buffer instead of sending it to the
106+ formatter object. Retrieve the stored data via \code {save_end()}
107+ Use of the \code {save_bgn()} / \code {save_end()} pair may not be
108+ nested.
244109\end {funcdesc }
245110
246- \begin {datadesc }{nospace}
247- If this instance variable is true, empty words should be ignored by
248- \code {addword}. It should be set to false after a non-empty word has
249- been added.
250- \end {datadesc }
251-
252- \begin {funcdesc }{setjust}{justification}
253- Set the justification of the current paragraph. The
254- \var {justification} can be \code {'c'} (center), \code {'l'} (left
255- justified), \code {'r'} (right justified) or \code {'lr'} (left and
256- right justified).
257- \end {funcdesc }
258-
259- \begin {funcdesc }{bgn_anchor}{id}
260- Begin an anchor. The \var {id} parameter is the value of the parser's
261- \code {inanchor} attribute.
111+ \begin {funcdesc }{save_end}{}
112+ Ends buffering character data and returns all data saved since the
113+ preceeding call to \code {save_bgn()}. If \code {nofill} flag is false,
114+ whitespace is collapsed to single spaces. A call to this method
115+ without a preceeding call to \code {save_bgn()} will raise a
116+ \code {TypeError} exception.
262117\end {funcdesc }
263-
264- \begin {funcdesc }{end_anchor}{id}
265- End an anchor. The \var {id} parameter is the value of the parser's
266- \code {inanchor} attribute.
267- \end {funcdesc }
268-
269- A sample formatter implementation can be found in the module
270- \code {fmt}, which in turn uses the module \code {Para}. These modules are
271- not intended as standard library modules; they are available as an
272- example of how to write a formatter.
273- \ttindex {fmt}
274- \ttindex {Para}
0 commit comments