email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences

https://github.com/python/cpython/blob/eefff682f09394fe4f18b7d7c6ac4c635caadd02/Lib/email/policy.py#L208

I think it's problematic that the method `email.policy.EmailPolicy._fold()` relies on the generic `str` / `bytes` method `.splitlines()`, especially in an email-processing context where the "official" line ending is `\r\n`.

I'm one of many devs who also leniently recognise (regex) `[\r\n]+` as a line break in emails. But I have no idea why all the other ending characters from other contexts are also used in a specific mail-manipulation context.

On the surface, `.splitlines()` seems a simple way to cover the case of a header value itself containing line endings.

However, in cases where a header value may contain multi-byte Unicode sequences, this causes breakage, because characters such as `\x0C` (which may potentially be part of a sequence) instead get treated as legacy ASCII 'form-feed', and deemed to be a line ending. This then breaks the sequence, which in turn, causes problems in the subsequent processing of the email message.

A specimen header (from real-world production traffic) which triggers this behaviour is:
```
b'Subject: P/L SEND : CARA-23PH00021,,   0xf2\x0C\xd8/FTEP'
```
Here, the `\x0C` is treated as a line-ending, so the trailing portion `b'\xd8/FTEP'` gets wrapped and indented on the next line.

To work around this in my networks, I've had to subclass `email.policy.EmailPolicy`, and override the method `._fold()` to instead split only on CR/LFs, via
```
RE_EOL_STR = re.compile(r'[\r\n]+')
RE_EOL_BYTES = re.compile(rb'[\r\n]+')

...

class MyPolicy(email.policy.EmailPolicy):

    ...

    def _fold(self, name, value, refold_binary=False):
        """
        Need to override this from email.policy.EmailPolicy to stop it treating chars other than
        CR and LF as newlines
        :param name:
        :param value:
        :param refold_binary:
        :return:
        """
        if hasattr(value, 'name'):
            return value.fold(policy=self)
        maxlen = self.max_line_length if self.max_line_length else sys.maxsize

        # this is from the library version, and it improperly breaks on chars like 0x0c, treating
        # them as 'form feed' etc.
        # we need to ensure that only CR/LF is used as end of line
        #lines = value.splitlines()

        # this is a workaround which splits only on CR/LF characters
        if refold_binary:
            lines = RE_EOL_BYTES.split(value)
        else:
            lines = RE_EOL_STR.split(value)

        refold = (self.refold_source == 'all' or
                  self.refold_source == 'long' and
                    (lines and len(lines[0])+len(name)+2 > maxlen or
                     any(len(x) > maxlen for x in lines[1:])))
        if refold or refold_binary and _has_surrogates(value):
            return self.header_factory(name, ''.join(lines)).fold(policy=self)
        return name + ': ' + self.linesep.join(lines) + self.linesep
```

Can the maintainers of this class please advise with their thoughts?

Given that RFC822 and related standards specify that the "official" line ending is `\r\n`, is there any reason to catch everything else that may also be considered in other string contexts to constitute a line ending?



### Linked PRs
* gh-117369
* gh-117971
* gh-117972

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

email.policy.EmailPolicy._fold() breaking multi-byte Unicode sequences #117313

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions