Thanks to visit codestin.com
Credit goes to github.com

Skip to content

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

Closed
@davidmcnabnz

Description

@davidmcnabnz

lines = value.splitlines()

I think it's problematic that the method email.policy.EmailPolicy._fold() relies on the generic str / bytes method .splitlines(), especially in an email-processing context where the "official" line ending is \r\n.

I'm one of many devs who also leniently recognise (regex) [\r\n]+ as a line break in emails. But I have no idea why all the other ending characters from other contexts are also used in a specific mail-manipulation context.

On the surface, .splitlines() seems a simple way to cover the case of a header value itself containing line endings.

However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as \x0C (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending. This then breaks the sequence, which in turn, causes problems in the subsequent processing of the email message.

A specimen header (from real-world production traffic) which triggers this behaviour is:

b'Subject: P/L SEND : CARA-23PH00021,,   0xf2\x0C\xd8/FTEP'

Here, the \x0C is treated as a line-ending, so the trailing portion b'\xd8/FTEP' gets wrapped and indented on the next line.

To work around this in my networks, I've had to subclass email.policy.EmailPolicy, and override the method ._fold() to instead split only on CR/LFs, via

RE_EOL_STR = re.compile(r'[\r\n]+')
RE_EOL_BYTES = re.compile(rb'[\r\n]+')

...

class MyPolicy(email.policy.EmailPolicy):

    ...

    def _fold(self, name, value, refold_binary=False):
        """
        Need to override this from email.policy.EmailPolicy to stop it treating chars other than
        CR and LF as newlines
        :param name:
        :param value:
        :param refold_binary:
        :return:
        """
        if hasattr(value, 'name'):
            return value.fold(policy=self)
        maxlen = self.max_line_length if self.max_line_length else sys.maxsize

        # this is from the library version, and it improperly breaks on chars like 0x0c, treating
        # them as 'form feed' etc.
        # we need to ensure that only CR/LF is used as end of line
        #lines = value.splitlines()

        # this is a workaround which splits only on CR/LF characters
        if refold_binary:
            lines = RE_EOL_BYTES.split(value)
        else:
            lines = RE_EOL_STR.split(value)

        refold = (self.refold_source == 'all' or
                  self.refold_source == 'long' and
                    (lines and len(lines[0])+len(name)+2 > maxlen or
                     any(len(x) > maxlen for x in lines[1:])))
        if refold or refold_binary and _has_surrogates(value):
            return self.header_factory(name, ''.join(lines)).fold(policy=self)
        return name + ': ' + self.linesep.join(lines) + self.linesep

Can the maintainers of this class please advise with their thoughts?

Given that RFC822 and related standards specify that the "official" line ending is \r\n, is there any reason to catch everything else that may also be considered in other string contexts to constitute a line ending?

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions