|
1 | 1 | \section{Built-in module \sectcode{htmllib}} |
2 | 2 | \stmodindex{htmllib} |
3 | | -To be provided. |
| 3 | +\index{HTML} |
| 4 | +\index{hypertext} |
| 5 | + |
| 6 | +\renewcommand{\indexsubitem}{(in module htmllib)} |
| 7 | + |
| 8 | +This module defines a number of classes which can serve as a basis for |
| 9 | +parsing text files formatted in HTML (HyperText Mark-up Language). |
| 10 | +The classes are not directly concerned with I/O --- the have to be fed |
| 11 | +their input in string form, and will make calls to methods of a |
| 12 | +``formatter'' object in order to produce output. The classes are |
| 13 | +designed to be used as base classes for other classes in order to add |
| 14 | +functionality, and allow most of their methods to be extended or |
| 15 | +overridden. In turn, the classes are derived from and extend the |
| 16 | +class \code{SGMLParser} defined in module \code{sgmllib}. |
| 17 | +\index{SGML} |
| 18 | +\stmodindex{sgmllib} |
| 19 | +\ttindex{SGMLParser} |
| 20 | +\index{formatter} |
| 21 | + |
| 22 | +The following is a summary of the interface defined by |
| 23 | +\code{sgmllib.SGMLParser}: |
| 24 | + |
| 25 | +\begin{itemize} |
| 26 | + |
| 27 | +\item |
| 28 | +The interface to feed data to an instance is through the \code{feed()} |
| 29 | +method, which takes a string argument. This can be called with as |
| 30 | +little or as much text at a time. When the data contains complete |
| 31 | +HTML elements, these are processed immediately; incomplete elements |
| 32 | +are saved in a buffer. To force processing of all unprocessed data, |
| 33 | +call the \code{close()} method. |
| 34 | + |
| 35 | +Example: to parse the entire contents of a file, do |
| 36 | +\code{parser.feed(open(file).read()); parser.close()}. |
| 37 | + |
| 38 | +\item |
| 39 | +The interface to define semantics for HTML tags is very simple: derive |
| 40 | +a class and define methods called \code{start_\var{tag}()}, |
| 41 | +\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will |
| 42 | +call these at appropriate moments: \code{start_\var{tag}} or |
| 43 | +\code{do_\var{tag}} is called when an opening tag of the form |
| 44 | +\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called |
| 45 | +when a closing tag of the form \code{<\var{tag}>} is encountered. If |
| 46 | +an opening tag requires a corresponding closing tag, like \code{<H1>} |
| 47 | +... \code{</H1>}, the class should define the \code{start_\var{tag}} |
| 48 | +method; if a tag requires no closing tag, like \code{<P>}, the class |
| 49 | +should define the \code{do_\var{tag}} method. |
| 50 | + |
| 51 | +\end{itemize} |
| 52 | + |
| 53 | +The module defines the following classes: |
| 54 | + |
| 55 | +\begin{funcdesc}{HTMLParser}{} |
| 56 | +This is the most basic HTML parser class. It defines one additional |
| 57 | +entity name over the names defined by the \code{SGMLParser} base |
| 58 | +class, \code{\•}. It also defines handlers for the following |
| 59 | +tags: \code{<LISTING>...</LISTING>}, \code{<XMP>...</XMP>}, and |
| 60 | +\code{<PLAINTEXT>} (the latter is terminated only by end of file). |
| 61 | +\end{funcdesc} |
| 62 | + |
| 63 | +\begin{funcdesc}{CollectingParser}{} |
| 64 | +This class, derived from \code{HTMLParser}, collects various useful |
| 65 | +bits of information from the HTML text. To this end it defines |
| 66 | +additional handlers for the following tags: \code{<A>...</A>}, |
| 67 | +\code{<HEAD>...</HEAD>}, \code{<BODY>...</BODY>}, |
| 68 | +\code{<TITLE>...</TITLE>}, \code{<NEXTID>}, and \code{<ISINDEX>}. |
| 69 | +\end{funcdesc} |
| 70 | + |
| 71 | +\begin{funcdesc}{FormattingParser}{formatter\, stylesheet} |
| 72 | +This class, derived from \code{CollectingParser}, interprets a wide |
| 73 | +selection of HTML tags so it can produce formatted output from the |
| 74 | +parsed data. It is initialized with two objects, a \var{formatter} |
| 75 | +which should define a number of methods to format text into |
| 76 | +paragraphs, and a \var{stylesheet} which defines a number of static |
| 77 | +parameters for the formatting process. Formatters and style sheets |
| 78 | +are documented later in this section. |
| 79 | +\index{formatter} |
| 80 | +\index{style sheet} |
| 81 | +\end{funcdesc} |
| 82 | + |
| 83 | +\begin{funcdesc}{AnchoringParser}{formatter\, stylesheet} |
| 84 | +This class, derived from \code{FormattingParser}, extends the handling |
| 85 | +of the \code{<A>...</A>} tag pair to call the formatter's |
| 86 | +\code{bgn_anchor()} and \code{end_anchor()} methods. This allows the |
| 87 | +formatter to display the anchor in a different font or color, etc. |
| 88 | +\end{funcdesc} |
| 89 | + |
| 90 | +Instances of \code{CollectingParser} (and thus also instances of |
| 91 | +\code{FormattingParser} and \code{AnchoringParser}) have the following |
| 92 | +instance variables: |
| 93 | + |
| 94 | +\begin{datadesc}{anchornames} |
| 95 | +A list of the values if the \code{NAME} attributes of the \code{<A>} |
| 96 | +tags encountered. |
| 97 | +\end{datadesc} |
| 98 | + |
| 99 | +\begin{datadesc}{anchors} |
| 100 | +A list of the values of \code{HREF} attributes of the \code{<A>} tags |
| 101 | +encountered. |
| 102 | +\end{datadesc} |
| 103 | + |
| 104 | +\begin{datadesc}{anchortypes} |
| 105 | +A list of the values if the \code{TYPE} attributes of the \code{<A>} |
| 106 | +tags encountered. |
| 107 | +\end{datadesc} |
| 108 | + |
| 109 | +\begin{datadesc}{inanchor} |
| 110 | +Outside an \code{<A>...</A>} tag pair, this is zero. inside such a |
| 111 | +pair, it is a unique integer, which is positive if the anchor has a |
| 112 | +\code{HREF} attribute, negative if it hasn't. Its absolute value is |
| 113 | +one more than the index of the anchor in the \code{anchors}, |
| 114 | +\code{anchornames} and \code{anchortypes} lists. |
| 115 | +\end{datadesc} |
| 116 | + |
| 117 | +\begin{datadesc}{isindex} |
| 118 | +True if the \code{<ISINDEX>} tag has been encountered. |
| 119 | +\end{datadesc} |
| 120 | + |
| 121 | +\begin{datadesc}{nextid} |
| 122 | +The attribute list of the last \code{<NEXTID>} tag encountered, or |
| 123 | +an empty list if none. |
| 124 | +\end{datadesc} |
| 125 | + |
| 126 | +\begin{datadesc}{title} |
| 127 | +The text inside the last \code{<TITLE>...</TITLE>} tag pair, or |
| 128 | +\code{''} if no title has been encountered yet. |
| 129 | +\end{datadesc} |
| 130 | + |
| 131 | +The \code{anchors}, \code{anchornames} and \code{anchortypes} lists |
| 132 | +are ``parallel arrays'': items in these lists with the same index |
| 133 | +pertain to the same anchor. Missing attributes default to the empty |
| 134 | +string. Anchors with neither a \code{HREF} not a \code{NAME} |
| 135 | +attribute are not entered in these lists at all. |
| 136 | + |
| 137 | +The module also defines a number of style sheet classes. These should |
| 138 | +never be instantiated --- their class variables are the only behaviour |
| 139 | +required. Note that style sheets are specifically designed for a |
| 140 | +particular formatter implementation. The currently defined style |
| 141 | +sheets are: |
| 142 | +\index{style sheet} |
| 143 | + |
| 144 | +\begin{datadesc}{NullStylesheet} |
| 145 | +A style sheet for use on a dumb output device such as an ASCII |
| 146 | +terminal. |
| 147 | +\end{datadesc} |
| 148 | + |
| 149 | +\begin{datadesc}{X11Stylesheet} |
| 150 | +A style sheet for use with an X11 server. |
| 151 | +\end{datadesc} |
| 152 | + |
| 153 | +\begin{datadesc}{MacStylesheet} |
| 154 | +A style sheet for use on Apple Macintosh computers. |
| 155 | +\end{datadesc} |
| 156 | + |
| 157 | +\begin{datadesc}{StdwinStylesheet} |
| 158 | +A style sheet for use with the \code{stdwin} module; it is an alias |
| 159 | +for either \code{X11Stylesheet} or \code{MacStylesheet}. |
| 160 | +\bimodindex{stdwin} |
| 161 | +\end{datadesc} |
| 162 | + |
| 163 | +\begin{datadesc}{GLStylesheet} |
| 164 | +A style sheet for use with the SGI Graphics Library and its font |
| 165 | +manager (the SGI-specific built-in modules \code{gl} and \code{fm}). |
| 166 | +\bimodindex{gl} |
| 167 | +\bimodindex{fm} |
| 168 | +\end{datadesc} |
| 169 | + |
| 170 | +Style sheets have the following class variables: |
| 171 | + |
| 172 | +\begin{datadesc}{stdfontset} |
| 173 | +A list of up to four font definititions, respectively for the roman, |
| 174 | +italic, bold and constant-width variant of a font for normal text. If |
| 175 | +the list contains less than four font definitions, the last item is |
| 176 | +used as the default for missing items. The type of a font definition |
| 177 | +depends on the formatter in use; its only use is as a parameter to the |
| 178 | +formatter's \code{setfont()} method. |
| 179 | +\end{datadesc} |
| 180 | + |
| 181 | +\begin{datadesc}{h1fontset} |
| 182 | +\dataline{h2fontset} |
| 183 | +\dataline{h3fontset} |
| 184 | +The font set used for various headers (text inside \code{<H1>...</H1>} |
| 185 | +tag pairs etc.). |
| 186 | +\end{datadesc} |
| 187 | + |
| 188 | +\begin{datadesc}{stdindent} |
| 189 | +The indentation of normal text. This is measured in the ``native'' |
| 190 | +units of the formatter in use; for some formatters these are |
| 191 | +characters, for others (especially those that actually support |
| 192 | +variable-spacing fonts) in pixels or printer points. |
| 193 | +\end{datadesc} |
| 194 | + |
| 195 | +\begin{datadesc}{ddindent} |
| 196 | +The indentation used for the first level of \code{<DD>} tags. |
| 197 | +\end{datadesc} |
| 198 | + |
| 199 | +\begin{datadesc}{ulindent} |
| 200 | +The indentation used for the first level of \code{<UL>} tags. |
| 201 | +\end{datadesc} |
| 202 | + |
| 203 | +\begin{datadesc}{h1indent} |
| 204 | +The indentation used for level 1 headers. |
| 205 | +\end{datadesc} |
| 206 | + |
| 207 | +\begin{datadesc}{h2indent} |
| 208 | +The indentation used for level 2 headers. |
| 209 | +\end{datadesc} |
| 210 | + |
| 211 | +\begin{datadesc}{literalindent} |
| 212 | +The indentation used for literal text (text inside |
| 213 | +\code{<PRE>...</PRE>} and similar tag pairs). |
| 214 | +\end{datadesc} |
| 215 | + |
| 216 | +Although no documented implementation of a formatter exists, the |
| 217 | +\code{FormattingParser} class assumes that formatters have a |
| 218 | +certain interface. This interface requires the following methods: |
| 219 | +\index{formatter} |
| 220 | + |
| 221 | +\begin{funcdesc}{setfont}{fontspec} |
| 222 | +Set the font to be used subsequently. The \var{fontspec} argument is |
| 223 | +an item in a style sheet's font set. |
| 224 | +\end{funcdesc} |
| 225 | + |
| 226 | +\begin{funcdesc}{flush}{} |
| 227 | +Finish the current line, if not empty, and begin a new one. |
| 228 | +\end{funcdesc} |
| 229 | + |
| 230 | +\begin{funcdesc}{setleftindent}{n} |
| 231 | +Set the left indentation of the following lines to \var{n} units. |
| 232 | +\end{funcdesc} |
| 233 | + |
| 234 | +\begin{funcdesc}{needvspace}{n} |
| 235 | +Require at least \var{n} blank lines before the next line. Implies |
| 236 | +\code{flush()}. |
| 237 | +\end{funcdesc} |
| 238 | + |
| 239 | +\begin{funcdesc}{addword}{word\, space} |
| 240 | +Add a var{word} to the current paragraph, followed by \var{space} |
| 241 | +spaces. |
| 242 | +\end{funcdesc} |
| 243 | + |
| 244 | +\begin{datadesc}{nospace} |
| 245 | +If this instance variable is true, empty words are ignored by |
| 246 | +\code{addword}. It is set to false after a non-empty word has been |
| 247 | +added. |
| 248 | +\end{datadesc} |
| 249 | + |
| 250 | +\begin{funcdesc}{setjust}{justification} |
| 251 | +Set the justification of the current paragraph. The |
| 252 | +\var{justification} can be \code{'c'} (center), \code{'l'} (left |
| 253 | +justified), \code{'r'} (right justified) or \code{'lr'} (left and |
| 254 | +right justified). |
| 255 | +\end{funcdesc} |
| 256 | + |
| 257 | +\begin{funcdesc}{bgn_anchor}{id} |
| 258 | +Begin an anchor. The \var{id} parameter is the value of the parser's |
| 259 | +\code{inanchor} attribute. |
| 260 | +\end{funcdesc} |
| 261 | + |
| 262 | +\begin{funcdesc}{end_anchor}{id} |
| 263 | +End an anchor. The \var{id} parameter is the value of the parser's |
| 264 | +\code{inanchor} attribute. |
| 265 | +\end{funcdesc} |
| 266 | + |
| 267 | +A sample formatters implementation can be found in the module |
| 268 | +\code{fmt}, which in turn uses the module \code{Para}. These are |
| 269 | +currently not intended as a |
| 270 | +\ttindex{fmt} |
| 271 | +\ttindex{Para} |
0 commit comments