@@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1919.. class :: HTMLParser(strict=True)
2020
2121 Create a parser instance. If *strict * is ``True `` (the default), invalid
22- html results in :exc: `~html.parser.HTMLParseError ` exceptions [# ]_. If
22+ HTML results in :exc: `~html.parser.HTMLParseError ` exceptions [# ]_. If
2323 *strict * is ``False ``, the parser uses heuristics to make a best guess at
24- the intention of any invalid html it encounters, similar to the way most
25- browsers do.
24+ the intention of any invalid HTML it encounters, similar to the way most
25+ browsers do. Using `` strict=False `` is advised.
2626
27- An :class: `HTMLParser ` instance is fed HTML data and calls handler functions when tags
28- begin and end. The :class: `HTMLParser ` class is meant to be overridden by the
29- user to provide a desired behavior.
27+ An :class: `.HTMLParser ` instance is fed HTML data and calls handler methods
28+ when start tags, end tags, text, comments, and other markup elements are
29+ encountered. The user should subclass :class: `.HTMLParser ` and override its
30+ methods to implement the desired behavior.
3031
3132 This parser does not check that end tags match start tags or call the end-tag
3233 handler for elements which are closed implicitly by closing an outer element.
@@ -39,25 +40,61 @@ An exception is defined as well:
3940.. exception :: HTMLParseError
4041
4142 Exception raised by the :class: `HTMLParser ` class when it encounters an error
42- while parsing. This exception provides three attributes: :attr: `msg ` is a brief
43- message explaining the error, :attr: `lineno ` is the number of the line on which
44- the broken construct was detected, and :attr: `offset ` is the number of
45- characters into the line at which the construct starts.
43+ while parsing and *strict * is ``True ``. This exception provides three
44+ attributes: :attr: `msg ` is a brief message explaining the error,
45+ :attr: `lineno ` is the number of the line on which the broken construct was
46+ detected, and :attr: `offset ` is the number of characters into the line at
47+ which the construct starts.
4648
47- :class: `HTMLParser ` instances have the following methods:
4849
50+ Example HTML Parser Application
51+ -------------------------------
4952
50- .. method :: HTMLParser.reset()
53+ As a basic example, below is a simple HTML parser that uses the
54+ :class: `HTMLParser ` class to print out start tags, end tags, and data
55+ as they are encountered::
5156
52- Reset the instance. Loses all unprocessed data. This is called implicitly at
53- instantiation time.
57+ from html.parser import HTMLParser
58+
59+ class MyHTMLParser(HTMLParser):
60+ def handle_starttag(self, tag, attrs):
61+ print("Encountered a start tag:", tag)
62+ def handle_endtag(self, tag):
63+ print("Encountered an end tag :", tag)
64+ def handle_data(self, data):
65+ print("Encountered some data :", data)
66+
67+ parser = MyHTMLParser(strict=False)
68+ parser.feed('<html><head><title>Test</title></head>'
69+ '<body><h1>Parse me!</h1></body></html>')
70+
71+ The output will then be::
72+
73+ Encountered a start tag: html
74+ Encountered a start tag: head
75+ Encountered a start tag: title
76+ Encountered some data : Test
77+ Encountered an end tag : title
78+ Encountered an end tag : head
79+ Encountered a start tag: body
80+ Encountered a start tag: h1
81+ Encountered some data : Parse me!
82+ Encountered an end tag : h1
83+ Encountered an end tag : body
84+ Encountered an end tag : html
85+
86+
87+ :class: `.HTMLParser ` Methods
88+ ----------------------------
89+
90+ :class: `HTMLParser ` instances have the following methods:
5491
5592
5693.. method :: HTMLParser.feed(data)
5794
5895 Feed some text to the parser. It is processed insofar as it consists of
5996 complete elements; incomplete data is buffered until more data is fed or
60- :meth: `close ` is called.
97+ :meth: `close ` is called. * data * must be :class: ` str `.
6198
6299
63100.. method :: HTMLParser.close()
@@ -68,6 +105,12 @@ An exception is defined as well:
68105 the :class: `HTMLParser ` base class method :meth: `close `.
69106
70107
108+ .. method :: HTMLParser.reset()
109+
110+ Reset the instance. Loses all unprocessed data. This is called implicitly at
111+ instantiation time.
112+
113+
71114.. method :: HTMLParser.getpos()
72115
73116 Return current line number and offset.
@@ -81,23 +124,35 @@ An exception is defined as well:
81124 attributes can be preserved, etc.).
82125
83126
127+ The following methods are called when data or markup elements are encountered
128+ and they are meant to be overridden in a subclass. The base class
129+ implementations do nothing (except for :meth: `~HTMLParser.handle_startendtag `):
130+
131+
84132.. method :: HTMLParser.handle_starttag(tag, attrs)
85133
86- This method is called to handle the start of a tag. It is intended to be
87- overridden by a derived class; the base class implementation does nothing.
134+ This method is called to handle the start of a tag (e.g. ``<div id="main"> ``).
88135
89136 The *tag * argument is the name of the tag converted to lower case. The *attrs *
90137 argument is a list of ``(name, value) `` pairs containing the attributes found
91138 inside the tag's ``<> `` brackets. The *name * will be translated to lower case,
92139 and quotes in the *value * have been removed, and character and entity references
93- have been replaced. For instance, for the tag ``<A
94- HREF="http://www.cwi.nl/"> ``, this method would be called as
95- ``handle_starttag('a', [('href', 'http://www.cwi.nl/')]) ``.
140+ have been replaced.
141+
142+ For instance, for the tag ``<A HREF="http://www.cwi.nl/"> ``, this method
143+ would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')]) ``.
96144
97145 All entity references from :mod: `html.entities ` are replaced in the attribute
98146 values.
99147
100148
149+ .. method :: HTMLParser.handle_endtag(tag)
150+
151+ This method is called to handle the end tag of an element (e.g. ``</div> ``).
152+
153+ The *tag * argument is the name of the tag converted to lower case.
154+
155+
101156.. method :: HTMLParser.handle_startendtag(tag, attrs)
102157
103158 Similar to :meth: `handle_starttag `, but called when the parser encounters an
@@ -106,57 +161,46 @@ An exception is defined as well:
106161 implementation simply calls :meth: `handle_starttag ` and :meth: `handle_endtag `.
107162
108163
109- .. method :: HTMLParser.handle_endtag(tag)
110-
111- This method is called to handle the end tag of an element. It is intended to be
112- overridden by a derived class; the base class implementation does nothing. The
113- *tag * argument is the name of the tag converted to lower case.
114-
115-
116164.. method :: HTMLParser.handle_data(data)
117165
118- This method is called to process arbitrary data (e.g. the content of
119- ``<script>...</script> `` and ``<style>...</style> ``). It is intended to be
120- overridden by a derived class; the base class implementation does nothing.
166+ This method is called to process arbitrary data (e.g. text nodes and the
167+ content of ``<script>...</script> `` and ``<style>...</style> ``).
121168
122169
123- .. method :: HTMLParser.handle_charref (name)
170+ .. method :: HTMLParser.handle_entityref (name)
124171
125- This method is called to process a character reference of the form `` &#ref; ``.
126- It is intended to be overridden by a derived class; the base class
127- implementation does nothing .
172+ This method is called to process a named character reference of the form
173+ `` &name; `` (e.g. `` > ``), where * name * is a general entity reference
174+ (e.g. `` 'gt' ``) .
128175
129176
130- .. method :: HTMLParser.handle_entityref (name)
177+ .. method :: HTMLParser.handle_charref (name)
131178
132- This method is called to process a general entity reference of the form
133- ``&name; `` where *name * is an general entity reference. It is intended to be
134- overridden by a derived class; the base class implementation does nothing.
179+ This method is called to process decimal and hexadecimal numeric character
180+ references of the form ``&#NNN; `` and ``&#xNNN; ``. For example, the decimal
181+ equivalent for ``> `` is ``> ``, whereas the hexadecimal is ``> ``;
182+ in this case the method will receive ``'62' `` or ``'x3E' ``.
135183
136184
137185.. method :: HTMLParser.handle_comment(data)
138186
139- This method is called when a comment is encountered. The *comment * argument is
140- a string containing the text between the ``-- `` and ``-- `` delimiters, but not
141- the delimiters themselves. For example, the comment ``<!--text--> `` will cause
142- this method to be called with the argument ``'text' ``. It is intended to be
143- overridden by a derived class; the base class implementation does nothing.
187+ This method is called when a comment is encountered (e.g. ``<!--comment--> ``).
144188
189+ For example, the comment ``<!-- comment --> `` will cause this method to be
190+ called with the argument ``' comment ' ``.
145191
146- .. method :: HTMLParser.handle_decl(decl)
192+ The content of Internet Explorer conditional comments (condcoms) will also be
193+ sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]--> ``,
194+ this method will receive ``'[if IE 9]>IE-specific content<![endif]' ``.
147195
148- Method called when an SGML ``doctype `` declaration is read by the parser.
149- The *decl * parameter will be the entire contents of the declaration inside
150- the ``<!...> `` markup. It is intended to be overridden by a derived class;
151- the base class implementation does nothing.
152196
197+ .. method :: HTMLParser.handle_decl(decl)
153198
154- .. method :: HTMLParser.unknown_decl(data)
199+ This method is called to handle an HTML doctype declaration (e.g.
200+ ``<!DOCTYPE html> ``).
155201
156- Method called when an unrecognized SGML declaration is read by the parser.
157- The *data * parameter will be the entire contents of the declaration inside
158- the ``<!...> `` markup. It is sometimes useful to be overridden by a
159- derived class; the base class implementation raises an :exc: `HTMLParseError `.
202+ The *decl * parameter will be the entire contents of the declaration inside
203+ the ``<!...> `` markup (e.g. ``'DOCTYPE html' ``).
160204
161205
162206.. method :: HTMLParser.handle_pi(data)
@@ -174,29 +218,123 @@ An exception is defined as well:
174218 cause the ``'?' `` to be included in *data *.
175219
176220
177- .. _ htmlparser-example :
221+ .. method :: HTMLParser.unknown_decl(data)
178222
179- Example HTML Parser Application
180- -------------------------------
223+ This method is called when an unrecognized declaration is read by the parser.
224+
225+ The *data * parameter will be the entire contents of the declaration inside
226+ the ``<![...]> `` markup. It is sometimes useful to be overridden by a
227+ derived class. The base class implementation raises an :exc: `HTMLParseError `
228+ when *strict * is ``True ``.
181229
182- As a basic example, below is a simple HTML parser that uses the
183- :class: `HTMLParser ` class to print out start tags, end tags, and data
184- as they are encountered::
230+
231+ .. _htmlparser-examples :
232+
233+ Examples
234+ --------
235+
236+ The following class implements a parser that will be used to illustrate more
237+ examples::
185238
186239 from html.parser import HTMLParser
240+ from html.entities import name2codepoint
187241
188242 class MyHTMLParser(HTMLParser):
189243 def handle_starttag(self, tag, attrs):
190- print("Encountered a start tag:", tag)
244+ print("Start tag:", tag)
245+ for attr in attrs:
246+ print(" attr:", attr)
191247 def handle_endtag(self, tag):
192- print("Encountered an end tag :", tag)
248+ print("End tag :", tag)
193249 def handle_data(self, data):
194- print("Encountered some data:", data)
195-
196- parser = MyHTMLParser()
197- parser.feed('<html><head><title>Test</title></head>'
198- '<body><h1>Parse me!</h1></body></html>')
199-
250+ print("Data :", data)
251+ def handle_comment(self, data):
252+ print("Comment :", data)
253+ def handle_entityref(self, name):
254+ c = chr(name2codepoint[name])
255+ print("Named ent:", c)
256+ def handle_charref(self, name):
257+ if name.startswith('x'):
258+ c = chr(int(name[1:], 16))
259+ else:
260+ c = chr(int(name))
261+ print("Num ent :", c)
262+ def handle_decl(self, data):
263+ print("Decl :", data)
264+
265+ parser = MyHTMLParser(strict=False)
266+
267+ Parsing a doctype::
268+
269+ >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
270+ ... '"http://www.w3.org/TR/html4/strict.dtd">')
271+ Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
272+
273+ Parsing an element with a few attributes and a title::
274+
275+ >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
276+ Start tag: img
277+ attr: ('src', 'python-logo.png')
278+ attr: ('alt', 'The Python logo')
279+ >>>
280+ >>> parser.feed('<h1>Python</h1>')
281+ Start tag: h1
282+ Data : Python
283+ End tag : h1
284+
285+ The content of ``script `` and ``style `` elements is returned as is, without
286+ further parsing::
287+
288+ >>> parser.feed('<style type="text/css">#python { color: green }</style>')
289+ Start tag: style
290+ attr: ('type', 'text/css')
291+ Data : #python { color: green }
292+ End tag : style
293+ >>>
294+ >>> parser.feed('<script type="text/javascript">'
295+ ... 'alert("<strong>hello!</strong>");</script>')
296+ Start tag: script
297+ attr: ('type', 'text/javascript')
298+ Data : alert("<strong>hello!</strong>");
299+ End tag : script
300+
301+ Parsing comments::
302+
303+ >>> parser.feed('<!-- a comment -->'
304+ ... '<!--[if IE 9]>IE-specific content<![endif]-->')
305+ Comment : a comment
306+ Comment : [if IE 9]>IE-specific content<![endif]
307+
308+ Parsing named and numeric character references and converting them to the
309+ correct char (note: these 3 references are all equivalent to ``'>' ``)::
310+
311+ >>> parser.feed('>>>')
312+ Named ent: >
313+ Num ent : >
314+ Num ent : >
315+
316+ Feeding incomplete chunks to :meth: `~HTMLParser.feed ` works, but
317+ :meth: `~HTMLParser.handle_data ` might be called more than once::
318+
319+ >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
320+ ... parser.feed(chunk)
321+ ...
322+ Start tag: span
323+ Data : buff
324+ Data : ered
325+ Data : text
326+ End tag : span
327+
328+ Parsing invalid HTML (e.g. unquoted attributes) also works::
329+
330+ >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
331+ Start tag: p
332+ Start tag: a
333+ attr: ('class', 'link')
334+ attr: ('href', '#main')
335+ Data : tag soup
336+ End tag : p
337+ End tag : a
200338
201339.. rubric :: Footnotes
202340
0 commit comments