Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 3861d8b

Browse files
committed
#15114: the strict mode of HTMLParser and the HTMLParseError exception are deprecated now that the parser is able to parse invalid markup.
1 parent a4db02c commit 3861d8b

4 files changed

Lines changed: 35 additions & 18 deletions

File tree

Doc/library/html.parser.rst

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@
1616
This module defines a class :class:`HTMLParser` which serves as the basis for
1717
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1818

19-
.. class:: HTMLParser(strict=True)
19+
.. class:: HTMLParser(strict=False)
2020

21-
Create a parser instance. If *strict* is ``True`` (the default), invalid
22-
HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
23-
*strict* is ``False``, the parser uses heuristics to make a best guess at
24-
the intention of any invalid HTML it encounters, similar to the way most
25-
browsers do. Using ``strict=False`` is advised.
21+
Create a parser instance. If *strict* is ``False`` (the default), the parser
22+
will accept and parse invalid markup. If *strict* is ``True`` the parser
23+
will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
24+
it's not able to parse the markup.
25+
The use of ``strict=True`` is discouraged and the *strict* argument is
26+
deprecated.
2627

2728
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
2829
when start tags, end tags, text, comments, and other markup elements are
@@ -34,6 +35,10 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
3435

3536
.. versionchanged:: 3.2 *strict* keyword added
3637

38+
.. deprecated-removed:: 3.3 3.5
39+
The *strict* argument and the strict mode have been deprecated.
40+
The parser is now able to accept and parse invalid markup too.
41+
3742
An exception is defined as well:
3843

3944

@@ -46,6 +51,10 @@ An exception is defined as well:
4651
detected, and :attr:`offset` is the number of characters into the line at
4752
which the construct starts.
4853

54+
.. deprecated-removed:: 3.3 3.5
55+
This exception has been deprecated because it's never raised by the parser
56+
(when the default non-strict mode is used).
57+
4958

5059
Example HTML Parser Application
5160
-------------------------------

Lib/html/parser.py

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111
import _markupbase
1212
import re
13+
import warnings
1314

1415
# Regular expressions used for parsing
1516

@@ -113,14 +114,16 @@ class HTMLParser(_markupbase.ParserBase):
113114

114115
CDATA_CONTENT_ELEMENTS = ("script", "style")
115116

116-
def __init__(self, strict=True):
117+
def __init__(self, strict=False):
117118
"""Initialize and reset this instance.
118119
119-
If strict is set to True (the default), errors are raised when invalid
120-
HTML is encountered. If set to False, an attempt is instead made to
121-
continue parsing, making "best guesses" about the intended meaning, in
122-
a fashion similar to what browsers typically do.
120+
If strict is set to False (the default) the parser will parse invalid
121+
markup, otherwise it will raise an error. Note that the strict mode
122+
is deprecated.
123123
"""
124+
if strict:
125+
warnings.warn("The strict mode is deprecated.",
126+
DeprecationWarning, stacklevel=2)
124127
self.strict = strict
125128
self.reset()
126129

@@ -271,8 +274,8 @@ def goahead(self, end):
271274
# See also parse_declaration in _markupbase
272275
def parse_html_declaration(self, i):
273276
rawdata = self.rawdata
274-
if rawdata[i:i+2] != '<!':
275-
self.error('unexpected call to parse_html_declaration()')
277+
assert rawdata[i:i+2] == '<!', ('unexpected call to '
278+
'parse_html_declaration()')
276279
if rawdata[i:i+4] == '<!--':
277280
# this case is actually already handled in goahead()
278281
return self.parse_comment(i)
@@ -292,8 +295,8 @@ def parse_html_declaration(self, i):
292295
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
293296
def parse_bogus_comment(self, i, report=1):
294297
rawdata = self.rawdata
295-
if rawdata[i:i+2] not in ('<!', '</'):
296-
self.error('unexpected call to parse_comment()')
298+
assert rawdata[i:i+2] in ('<!', '</'), ('unexpected call to '
299+
'parse_comment()')
297300
pos = rawdata.find('>', i+2)
298301
if pos == -1:
299302
return -1

Lib/test/test_htmlparser.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,8 @@ def parse(source=source):
102102
class HTMLParserStrictTestCase(TestCaseBase):
103103

104104
def get_collector(self):
105-
return EventCollector(strict=True)
105+
with support.check_warnings(("", DeprecationWarning), quite=False):
106+
return EventCollector(strict=True)
106107

107108
def test_processing_instruction_only(self):
108109
self._run_check("<?processing instruction>", [
@@ -594,7 +595,8 @@ def test_broken_condcoms(self):
594595
class AttributesStrictTestCase(TestCaseBase):
595596

596597
def get_collector(self):
597-
return EventCollector(strict=True)
598+
with support.check_warnings(("", DeprecationWarning), quite=False):
599+
return EventCollector(strict=True)
598600

599601
def test_attr_syntax(self):
600602
output = [

Misc/NEWS

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,9 @@ Core and Builtins
4343
Library
4444
-------
4545

46+
- Issue #15114: the strict mode of HTMLParser and the HTMLParseError exception
47+
are deprecated now that the parser is able to parse invalid markup.
48+
4649
- Issue #3665: \u and \U escapes are now supported in unicode regular
4750
expressions. Patch by Serhiy Storchaka.
4851

@@ -78,7 +81,7 @@ Library
7881
- Issue #9527: datetime.astimezone() method will now supply a class
7982
timezone instance corresponding to the system local timezone when
8083
called with no arguments.
81-
84+
8285
- Issue #14653: email.utils.mktime_tz() no longer relies on system
8386
mktime() when timezone offest is supplied.
8487

0 commit comments

Comments
 (0)