Incorrect decoding of preamble in email parser

# Bug report

### Bug description:

Method: `email.message_from_binary_file()`

It seems we've found an issue with the email.parser module when parsing raw binary MIME message where the preamble contains UTF-8 encoded data

When using the `.as_string()` method on the returned message, the unicode data will contain invalid UTF-8 characters (_"surrogates not allowed"_)
The problem does not occur when using the `email.message_from_string()` method

I've made a somewhat minimal example (from the MIME RFC) exposing the issue.
Here `"préamble"` is decoded as `"pr\udcc3\udca9amble"`

------

```python
import io
import email
import email.policy

CONTENTS = """
From: Nathaniel Borenstein <nsb@bellcore.com>
To: Ned Freed <ned@innosoft.com>
Subject: Sample message
MIME-Version: 1.0
Content-type: multipart/mixed; boundary="i-am-boundary"

This is the préamble.  It is to be ignored, though it
is a handy place for mail composers to include an
explanatory note to non-MIME compliant readers.

--i-am-boundary
Content-type: text/plain; charset=us-ascii

This is explicitly typed plain ASCII text.
It DOES end with a linebreak.

--i-am-boundary
Content-type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

This should be correctly encapsulated: Un petit café ?

--i-am-boundary--
This is the epilogue.  It is also to be ignored.

""".lstrip()
CONTENTS_BYTES = io.BytesIO(CONTENTS.encode())

# Does not have an impact on the result
POLICY = email.policy.default.clone(utf8=True)


def show_message(msg):
    # Parts are correctly decoded in all cases
    for i, part in enumerate(msg.iter_parts(), 1):
        print(f'MIME PART {i}:')
        as_string = part.as_string()
        as_bytes = as_string.encode()
        print(as_string)

    as_string = msg.as_string()

    # When source was bytes, the unicode result of as_string is incorrect
    as_bytes = as_string.encode()
    print(as_string)


msg_from_binary = email.message_from_binary_file(CONTENTS_BYTES, policy=POLICY)
show_message(msg_from_binary)
# UnicodeEncodeError: 'utf-8' codec can't encode characters in position 192-193: surrogates not allowed

# Using the unicode representation is OK
#msg_from_string = email.message_from_string(CONTENTS, EmailMessage)
#show_message(msg_from_string)

```

I've tried adding a charset and content-transfer-encoding: 8bit in the headers, with the same result (I do not know if this is actually valid)


Looking at the [current code](https://github.com/python/cpython/blob/26bab423fb6e08f9df23c5c8f55e3d019150c454/Lib/email/parser.py#L95-L103) it seems that `BytesParser` always uses the ASCII encoding with `errors='surrogateescape'`

### CPython versions tested on:

3.11, CPython main branch

### Operating systems tested on:

Linux


### Linked PRs
* gh-134384

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect decoding of preamble in email parser #118718

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Incorrect decoding of preamble in email parser #118718

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions