Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit c48cfe3

Browse files
committed
#14020: merge with 3.2.
2 parents aa2c670 + 4279bc7 commit c48cfe3

1 file changed

Lines changed: 205 additions & 67 deletions

File tree

Doc/library/html.parser.rst

Lines changed: 205 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1919
.. class:: HTMLParser(strict=True)
2020

2121
Create a parser instance. If *strict* is ``True`` (the default), invalid
22-
html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
22+
HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
2323
*strict* is ``False``, the parser uses heuristics to make a best guess at
24-
the intention of any invalid html it encounters, similar to the way most
25-
browsers do.
24+
the intention of any invalid HTML it encounters, similar to the way most
25+
browsers do. Using ``strict=False`` is advised.
2626

27-
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
28-
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
29-
user to provide a desired behavior.
27+
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
28+
when start tags, end tags, text, comments, and other markup elements are
29+
encountered. The user should subclass :class:`.HTMLParser` and override its
30+
methods to implement the desired behavior.
3031

3132
This parser does not check that end tags match start tags or call the end-tag
3233
handler for elements which are closed implicitly by closing an outer element.
@@ -39,25 +40,61 @@ An exception is defined as well:
3940
.. exception:: HTMLParseError
4041

4142
Exception raised by the :class:`HTMLParser` class when it encounters an error
42-
while parsing. This exception provides three attributes: :attr:`msg` is a brief
43-
message explaining the error, :attr:`lineno` is the number of the line on which
44-
the broken construct was detected, and :attr:`offset` is the number of
45-
characters into the line at which the construct starts.
43+
while parsing and *strict* is ``True``. This exception provides three
44+
attributes: :attr:`msg` is a brief message explaining the error,
45+
:attr:`lineno` is the number of the line on which the broken construct was
46+
detected, and :attr:`offset` is the number of characters into the line at
47+
which the construct starts.
4648

47-
:class:`HTMLParser` instances have the following methods:
4849

50+
Example HTML Parser Application
51+
-------------------------------
4952

50-
.. method:: HTMLParser.reset()
53+
As a basic example, below is a simple HTML parser that uses the
54+
:class:`HTMLParser` class to print out start tags, end tags, and data
55+
as they are encountered::
5156

52-
Reset the instance. Loses all unprocessed data. This is called implicitly at
53-
instantiation time.
57+
from html.parser import HTMLParser
58+
59+
class MyHTMLParser(HTMLParser):
60+
def handle_starttag(self, tag, attrs):
61+
print("Encountered a start tag:", tag)
62+
def handle_endtag(self, tag):
63+
print("Encountered an end tag :", tag)
64+
def handle_data(self, data):
65+
print("Encountered some data :", data)
66+
67+
parser = MyHTMLParser(strict=False)
68+
parser.feed('<html><head><title>Test</title></head>'
69+
'<body><h1>Parse me!</h1></body></html>')
70+
71+
The output will then be::
72+
73+
Encountered a start tag: html
74+
Encountered a start tag: head
75+
Encountered a start tag: title
76+
Encountered some data : Test
77+
Encountered an end tag : title
78+
Encountered an end tag : head
79+
Encountered a start tag: body
80+
Encountered a start tag: h1
81+
Encountered some data : Parse me!
82+
Encountered an end tag : h1
83+
Encountered an end tag : body
84+
Encountered an end tag : html
85+
86+
87+
:class:`.HTMLParser` Methods
88+
----------------------------
89+
90+
:class:`HTMLParser` instances have the following methods:
5491

5592

5693
.. method:: HTMLParser.feed(data)
5794

5895
Feed some text to the parser. It is processed insofar as it consists of
5996
complete elements; incomplete data is buffered until more data is fed or
60-
:meth:`close` is called.
97+
:meth:`close` is called. *data* must be :class:`str`.
6198

6299

63100
.. method:: HTMLParser.close()
@@ -68,6 +105,12 @@ An exception is defined as well:
68105
the :class:`HTMLParser` base class method :meth:`close`.
69106

70107

108+
.. method:: HTMLParser.reset()
109+
110+
Reset the instance. Loses all unprocessed data. This is called implicitly at
111+
instantiation time.
112+
113+
71114
.. method:: HTMLParser.getpos()
72115

73116
Return current line number and offset.
@@ -81,23 +124,35 @@ An exception is defined as well:
81124
attributes can be preserved, etc.).
82125

83126

127+
The following methods are called when data or markup elements are encountered
128+
and they are meant to be overridden in a subclass. The base class
129+
implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
130+
131+
84132
.. method:: HTMLParser.handle_starttag(tag, attrs)
85133

86-
This method is called to handle the start of a tag. It is intended to be
87-
overridden by a derived class; the base class implementation does nothing.
134+
This method is called to handle the start of a tag (e.g. ``<div id="main">``).
88135

89136
The *tag* argument is the name of the tag converted to lower case. The *attrs*
90137
argument is a list of ``(name, value)`` pairs containing the attributes found
91138
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
92139
and quotes in the *value* have been removed, and character and entity references
93-
have been replaced. For instance, for the tag ``<A
94-
HREF="http://www.cwi.nl/">``, this method would be called as
95-
``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
140+
have been replaced.
141+
142+
For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
143+
would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
96144

97145
All entity references from :mod:`html.entities` are replaced in the attribute
98146
values.
99147

100148

149+
.. method:: HTMLParser.handle_endtag(tag)
150+
151+
This method is called to handle the end tag of an element (e.g. ``</div>``).
152+
153+
The *tag* argument is the name of the tag converted to lower case.
154+
155+
101156
.. method:: HTMLParser.handle_startendtag(tag, attrs)
102157

103158
Similar to :meth:`handle_starttag`, but called when the parser encounters an
@@ -106,57 +161,46 @@ An exception is defined as well:
106161
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
107162

108163

109-
.. method:: HTMLParser.handle_endtag(tag)
110-
111-
This method is called to handle the end tag of an element. It is intended to be
112-
overridden by a derived class; the base class implementation does nothing. The
113-
*tag* argument is the name of the tag converted to lower case.
114-
115-
116164
.. method:: HTMLParser.handle_data(data)
117165

118-
This method is called to process arbitrary data (e.g. the content of
119-
``<script>...</script>`` and ``<style>...</style>``). It is intended to be
120-
overridden by a derived class; the base class implementation does nothing.
166+
This method is called to process arbitrary data (e.g. text nodes and the
167+
content of ``<script>...</script>`` and ``<style>...</style>``).
121168

122169

123-
.. method:: HTMLParser.handle_charref(name)
170+
.. method:: HTMLParser.handle_entityref(name)
124171

125-
This method is called to process a character reference of the form ``&#ref;``.
126-
It is intended to be overridden by a derived class; the base class
127-
implementation does nothing.
172+
This method is called to process a named character reference of the form
173+
``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
174+
(e.g. ``'gt'``).
128175

129176

130-
.. method:: HTMLParser.handle_entityref(name)
177+
.. method:: HTMLParser.handle_charref(name)
131178

132-
This method is called to process a general entity reference of the form
133-
``&name;`` where *name* is an general entity reference. It is intended to be
134-
overridden by a derived class; the base class implementation does nothing.
179+
This method is called to process decimal and hexadecimal numeric character
180+
references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
181+
equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
182+
in this case the method will receive ``'62'`` or ``'x3E'``.
135183

136184

137185
.. method:: HTMLParser.handle_comment(data)
138186

139-
This method is called when a comment is encountered. The *comment* argument is
140-
a string containing the text between the ``--`` and ``--`` delimiters, but not
141-
the delimiters themselves. For example, the comment ``<!--text-->`` will cause
142-
this method to be called with the argument ``'text'``. It is intended to be
143-
overridden by a derived class; the base class implementation does nothing.
187+
This method is called when a comment is encountered (e.g. ``<!--comment-->``).
144188

189+
For example, the comment ``<!-- comment -->`` will cause this method to be
190+
called with the argument ``' comment '``.
145191

146-
.. method:: HTMLParser.handle_decl(decl)
192+
The content of Internet Explorer conditional comments (condcoms) will also be
193+
sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
194+
this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
147195

148-
Method called when an SGML ``doctype`` declaration is read by the parser.
149-
The *decl* parameter will be the entire contents of the declaration inside
150-
the ``<!...>`` markup. It is intended to be overridden by a derived class;
151-
the base class implementation does nothing.
152196

197+
.. method:: HTMLParser.handle_decl(decl)
153198

154-
.. method:: HTMLParser.unknown_decl(data)
199+
This method is called to handle an HTML doctype declaration (e.g.
200+
``<!DOCTYPE html>``).
155201

156-
Method called when an unrecognized SGML declaration is read by the parser.
157-
The *data* parameter will be the entire contents of the declaration inside
158-
the ``<!...>`` markup. It is sometimes useful to be overridden by a
159-
derived class; the base class implementation raises an :exc:`HTMLParseError`.
202+
The *decl* parameter will be the entire contents of the declaration inside
203+
the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
160204

161205

162206
.. method:: HTMLParser.handle_pi(data)
@@ -174,29 +218,123 @@ An exception is defined as well:
174218
cause the ``'?'`` to be included in *data*.
175219

176220

177-
.. _htmlparser-example:
221+
.. method:: HTMLParser.unknown_decl(data)
178222

179-
Example HTML Parser Application
180-
-------------------------------
223+
This method is called when an unrecognized declaration is read by the parser.
224+
225+
The *data* parameter will be the entire contents of the declaration inside
226+
the ``<![...]>`` markup. It is sometimes useful to be overridden by a
227+
derived class. The base class implementation raises an :exc:`HTMLParseError`
228+
when *strict* is ``True``.
181229

182-
As a basic example, below is a simple HTML parser that uses the
183-
:class:`HTMLParser` class to print out start tags, end tags, and data
184-
as they are encountered::
230+
231+
.. _htmlparser-examples:
232+
233+
Examples
234+
--------
235+
236+
The following class implements a parser that will be used to illustrate more
237+
examples::
185238

186239
from html.parser import HTMLParser
240+
from html.entities import name2codepoint
187241

188242
class MyHTMLParser(HTMLParser):
189243
def handle_starttag(self, tag, attrs):
190-
print("Encountered a start tag:", tag)
244+
print("Start tag:", tag)
245+
for attr in attrs:
246+
print(" attr:", attr)
191247
def handle_endtag(self, tag):
192-
print("Encountered an end tag:", tag)
248+
print("End tag :", tag)
193249
def handle_data(self, data):
194-
print("Encountered some data:", data)
195-
196-
parser = MyHTMLParser()
197-
parser.feed('<html><head><title>Test</title></head>'
198-
'<body><h1>Parse me!</h1></body></html>')
199-
250+
print("Data :", data)
251+
def handle_comment(self, data):
252+
print("Comment :", data)
253+
def handle_entityref(self, name):
254+
c = chr(name2codepoint[name])
255+
print("Named ent:", c)
256+
def handle_charref(self, name):
257+
if name.startswith('x'):
258+
c = chr(int(name[1:], 16))
259+
else:
260+
c = chr(int(name))
261+
print("Num ent :", c)
262+
def handle_decl(self, data):
263+
print("Decl :", data)
264+
265+
parser = MyHTMLParser(strict=False)
266+
267+
Parsing a doctype::
268+
269+
>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
270+
... '"http://www.w3.org/TR/html4/strict.dtd">')
271+
Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
272+
273+
Parsing an element with a few attributes and a title::
274+
275+
>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
276+
Start tag: img
277+
attr: ('src', 'python-logo.png')
278+
attr: ('alt', 'The Python logo')
279+
>>>
280+
>>> parser.feed('<h1>Python</h1>')
281+
Start tag: h1
282+
Data : Python
283+
End tag : h1
284+
285+
The content of ``script`` and ``style`` elements is returned as is, without
286+
further parsing::
287+
288+
>>> parser.feed('<style type="text/css">#python { color: green }</style>')
289+
Start tag: style
290+
attr: ('type', 'text/css')
291+
Data : #python { color: green }
292+
End tag : style
293+
>>>
294+
>>> parser.feed('<script type="text/javascript">'
295+
... 'alert("<strong>hello!</strong>");</script>')
296+
Start tag: script
297+
attr: ('type', 'text/javascript')
298+
Data : alert("<strong>hello!</strong>");
299+
End tag : script
300+
301+
Parsing comments::
302+
303+
>>> parser.feed('<!-- a comment -->'
304+
... '<!--[if IE 9]>IE-specific content<![endif]-->')
305+
Comment : a comment
306+
Comment : [if IE 9]>IE-specific content<![endif]
307+
308+
Parsing named and numeric character references and converting them to the
309+
correct char (note: these 3 references are all equivalent to ``'>'``)::
310+
311+
>>> parser.feed('&gt;&#62;&#x3E;')
312+
Named ent: >
313+
Num ent : >
314+
Num ent : >
315+
316+
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
317+
:meth:`~HTMLParser.handle_data` might be called more than once::
318+
319+
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
320+
... parser.feed(chunk)
321+
...
322+
Start tag: span
323+
Data : buff
324+
Data : ered
325+
Data : text
326+
End tag : span
327+
328+
Parsing invalid HTML (e.g. unquoted attributes) also works::
329+
330+
>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
331+
Start tag: p
332+
Start tag: a
333+
attr: ('class', 'link')
334+
attr: ('href', '#main')
335+
Data : tag soup
336+
End tag : p
337+
End tag : a
200338

201339
.. rubric:: Footnotes
202340

0 commit comments

Comments
 (0)