Description
Line 208 in eefff68
I think it's problematic that the method email.policy.EmailPolicy._fold()
relies on the generic str
/ bytes
method .splitlines()
, especially in an email-processing context where the "official" line ending is \r\n
.
I'm one of many devs who also leniently recognise (regex) [\r\n]+
as a line break in emails. But I have no idea why all the other ending characters from other contexts are also used in a specific mail-manipulation context.
On the surface, .splitlines()
seems a simple way to cover the case of a header value itself containing line endings.
However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C
(which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending. This then breaks the sequence, which in turn, causes problems in the subsequent processing of the email message.
A specimen header (from real-world production traffic) which triggers this behaviour is:
b'Subject: P/L SEND : CARA-23PH00021,, 0xf2\x0C\xd8/FTEP'
Here, the \x0C
is treated as a line-ending, so the trailing portion b'\xd8/FTEP'
gets wrapped and indented on the next line.
To work around this in my networks, I've had to subclass email.policy.EmailPolicy
, and override the method ._fold()
to instead split only on CR/LFs, via
RE_EOL_STR = re.compile(r'[\r\n]+')
RE_EOL_BYTES = re.compile(rb'[\r\n]+')
...
class MyPolicy(email.policy.EmailPolicy):
...
def _fold(self, name, value, refold_binary=False):
"""
Need to override this from email.policy.EmailPolicy to stop it treating chars other than
CR and LF as newlines
:param name:
:param value:
:param refold_binary:
:return:
"""
if hasattr(value, 'name'):
return value.fold(policy=self)
maxlen = self.max_line_length if self.max_line_length else sys.maxsize
# this is from the library version, and it improperly breaks on chars like 0x0c, treating
# them as 'form feed' etc.
# we need to ensure that only CR/LF is used as end of line
#lines = value.splitlines()
# this is a workaround which splits only on CR/LF characters
if refold_binary:
lines = RE_EOL_BYTES.split(value)
else:
lines = RE_EOL_STR.split(value)
refold = (self.refold_source == 'all' or
self.refold_source == 'long' and
(lines and len(lines[0])+len(name)+2 > maxlen or
any(len(x) > maxlen for x in lines[1:])))
if refold or refold_binary and _has_surrogates(value):
return self.header_factory(name, ''.join(lines)).fold(policy=self)
return name + ': ' + self.linesep.join(lines) + self.linesep
Can the maintainers of this class please advise with their thoughts?
Given that RFC822 and related standards specify that the "official" line ending is \r\n
, is there any reason to catch everything else that may also be considered in other string contexts to constitute a line ending?
Linked PRs
- gh-117313: Fix re-folding email messages containing non-standard line separators #117369
- [3.12] gh-117313: Fix re-folding email messages containing non-standard line separators (GH-117369) #117971
- [3.11] gh-117313: Fix re-folding email messages containing non-standard line separators (GH-117369) #117972