Codestin Search App

tiangolo · 2026-02-21T08:24:41Z

♻️ Refactor JSON Schema for bytes, use "contentMediaType": "application/octet-stream" instead of "format": "binary"

FastAPI counterpart: fastapi/fastapi#14953

Change Summary

I suspect the JSON Schema for bytes using "format": "binary" comes from my first implementation.

It was defined and suggested in OpenAPI 3.0.x (not in JSON Schema): https://spec.openapis.org/oas/v3.0.3.html#considerations-for-file-uploads

OpenAPI 3.1.x aligned support with JSON Schema draft 07, so it was suggested to upate file uploads to use the regular JSON Schema format: "contentMediaType": "application/octet-stream": https://learn.openapis.org/upgrading/v3.0-to-v3.1

This is defined in JSON Schema 07: https://json-schema.org/draft-07/json-schema-validation#rfc.section.8.4

For the Base64 fields, a reference example for Base64url in OpenAPI: https://spec.openapis.org/oas/latest.html#example-url-encoded-form-with-binary-values

And the relevant section in the JSON Schema spec: https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#name-contentencoding

JSON Schema 2020-12 Note

Now OpenAPI 3.2 is aligned with JSON Schema 2020-12, which is what Pydantic v2 implements.

It's the same as in JSON Schema draft 07, so this still applies: https://json-schema.org/draft/2020-12/draft-bhutton-json-schema-validation-01#name-contentmediatype

Docs

I see there's a file docs/plugins/schema_mappings.toml that seems like could potentially need updating, but not sure what's the format. It's currently used in the JSON Schema docs, but it seems that setup is currently broken:

https://docs.pydantic.dev/latest/concepts/json_schema/#json-schema-types

Backwards Compatibility and Versions

As this is defined in JSON Schema 2020, which is what Pydantic supports, I would not consider this a "breaking change" but rather a bug fix. Of course, it's up to you how you would consider releasing this, just my POV.

AI Disclaimer

I used Opus 4.6 to implement it, but the default output was not really good, I had to guide it quite a bit to implement it correctly, and tweak it and tweak it.

Full LLM conversation

User: Are there any upgrades that can be done in the JSON schema generation to keep it in line with JSON Schema 2020?

GitHub Copilot: Created 3 todos

Read , lines 1 to 120

Searched for regex schema_dialect|draft|2020|json-schema.org|schema_version|\$schema (**/pydantic/**/*.py), 20 results

Read , lines 120 to 500

Analyze JSON schema 2020-12 conformance

Read , lines 1 to 250

Read , lines 251 to 500

Read , lines 501 to 750

Read , lines 751 to 1000

Read , lines 1001 to 1250

Read , lines 1251 to 1500

Read , lines 1501 to 1750

Read , lines 1751 to 2000

Read , lines 2001 to 2250

Read , lines 2251 to 2500

Read , lines 2501 to 2874

Searched for regex anyOf|oneOf|allOf (**/pydantic/json_schema.py), 14 results

Searched for text additionalProperties (**/pydantic/json_schema.py), 17 results

Searched for regex uniqueItems|discriminator|format.*binary|format.*base64 (**/pydantic/json_schema.py), 20 results

Read

Completed: Analyze current JSON schema generation (1/3)

Here's a comprehensive analysis of JSON Schema 2020-12 conformance gaps and upgrade opportunities in Pydantic's schema generation:

Already Conformant

These areas already follow 2020-12 correctly:

$defs used everywhere (not legacy definitions)
prefixItems used for tuple schemas (not the old positional items array)
exclusiveMinimum/exclusiveMaximum emitted as numbers (not draft-04 booleans)
No additionalItems keyword (removed in 2020-12)

Conformance Issues

1. Missing `$schema` keyword (medium severity)

At json_schema.py, the $schema declaration is explicitly commented out:

# For now, we will not set the $schema key. However, if desired, this can be easily added by overriding
# this method and adding the following line after a call to super().generate(schema):
# json_schema['$schema'] = self.schema_dialect

The schema dialect is defined at json_schema.py as 'https://json-schema.org/draft/2020-12/schema' but never included in output. Without $schema, consumers cannot identify the dialect.

2. `bytes_schema` uses `format` instead of `contentEncoding` (medium severity)

At json_schema.py, bytes are represented as:

{'type': 'string', 'format': 'base64url' if ... else 'binary'}

Per 2020-12, the correct approach is:

{"type": "string", "contentEncoding": "base64", "contentMediaType": "application/octet-stream"}

format is for semantic string formats (date-time, uri, etc.), not encoding schemes. format: 'binary' is an OpenAPI extension, not a standard JSON Schema format.

3. OpenAPI `discriminator` in JSON Schema output (low severity)

At json_schema.py, tagged unions emit an OpenAPI-style discriminator object with propertyName/mapping. This is an OpenAPI 3.x extension, not part of JSON Schema 2020-12. The code itself has a TODO acknowledging this should be opt-in.

Missed Opportunities

4. Nullable types default to verbose `anyOf` instead of `type` array

At json_schema.py, nullable schemas default to:

{"anyOf": [{"type": "string"}, {"type": "null"}]}

The compact 2020-12-idiomatic form {"type": ["string", "null"]} is supported via union_format='primitive_type_array' but isn't the default. A hybrid approach (type arrays for simple primitives, anyOf for complex types) would produce cleaner schemas.

5. No use of `if`/`then`/`else`

Discriminated unions could be expressed as conditional schemas instead of oneOf. While oneOf is valid, if/then/else can be more efficient for validators and more readable.

6. No use of `unevaluatedProperties`/`unevaluatedItems`

Only additionalProperties is used. unevaluatedProperties is more correct when composing schemas via allOf (e.g., inheritance), since additionalProperties doesn't account for properties validated by subschemas. Not a practical bug currently since Pydantic inlines inherited fields.

7. No use of `$dynamicRef`/`$dynamicAnchor`

Standard $ref is used for recursive schemas. $dynamicRef/$dynamicAnchor would only matter for extensible recursive schema patterns.

Recommended Priority for Upgrades

Priority	Change	Lines	Impact
1	Enable `$schema` in output	json_schema.py	High — spec identification
2	Use `contentEncoding` for bytes	json_schema.py	Medium — spec correctness
3	Default to `type` arrays for simple nullables	json_schema.py	Medium — cleaner output
4	Make OpenAPI `discriminator` opt-in	json_schema.py	Low — separation of concerns

Items 1 and 2 are the most actionable conformance fixes. Items 3-4 are quality-of-life improvements. Items 5-7 (from the missed opportunities) are low priority and would require significant refactoring for marginal gain.

Similar code found with 1 license type

User: Fix the format: binary one

GitHub Copilot: Read , lines 775 to 800

Searched for text ser_json_bytes (**/pydantic/**/*.py), 8 results

So there are three modes: 'utf8', 'base64', and 'hex'. Let me check what the tests expect:

Searched for regex format.*base64|format.*binary|bytes_schema|contentEncoding (**/tests/**/*.py), 20 results

Read , lines 1870 to 1970

Searched for regex format.*hex|hex.*format (**/tests/test_json_schema.py), no results

Read , lines 870 to 900

Read , lines 510 to 530

Read , lines 2010 to 2310

Let me also check the Base64 types in test_types.py:

Read , lines 5590 to 5720

Searched for regex format.*base64|Base64.*json_schema|base64.*schema (**/pydantic/**/*.py), 4 results

Read , lines 2400 to 2470

Created 3 todos

Let me read the remaining test contexts I need:

Read , lines 518 to 555

Read , lines 2118 to 2135

Now I have all the context. The fix:

base64 mode: format: 'base64url' → contentEncoding: 'base64'
utf8 mode (default): format: 'binary' → removed (just {'type': 'string'})
hex mode: was falling through to format: 'binary' → contentEncoding: 'base16'

Replacing 106 lines with 110 lines in ,

Starting: Run tests to verify (3/3)

Now let me also check for the Optional[bytes] test case that I need to handle — it was referencing format: 'binary':

Read , lines 870 to 900

Good, the replacements look correct. Let me run the affected tests:

Filtering tests...

Ran terminal command: python -m pytest tests/test_json_schema.py::test_decimal_json_schema tests/test_json_schema.py::test_str_basic_types tests/test_json_schema.py::test_model_default_bytes tests/test_json_schema.py::test_dataclass_default_bytes tests/test_json_schema.py::test_typeddict_default_bytes tests/test_json_schema.py::test_constraints_schema_validation tests/test_json_schema.py::test_constraints_schema_serialization tests/test_json_schema.py::test_bytes_constrained_types -x -q 2>&1 | tail -40

Ran terminal command: python -m pytest tests/test_json_schema.py -k "test_decimal_json_schema or test_str_basic_types or test_model_default_bytes or test_dataclass_default_bytes or test_typeddict_default_bytes or test_constraints_schema_validation or test_constraints_schema_serialization or test_bytes_constrained_types" -x -q -p no:cacheprovider -o "addopts=" 2>&1 | tail -40

There's an environment issue (pydantic-core version mismatch), not related to my changes. Let me verify the logic directly:

Ran terminal command: cd /home/user/code/pydantic && python -c "
from pydantic.json_schema import GenerateJsonSchema
from pydantic_core import core_schema

gen = GenerateJsonSchema()

Test default (utf8) mode

result = gen.bytes_schema(core_schema.bytes_schema())
print('utf8 mode:', result)

Test base64 mode

from pydantic._internal import _config
gen2 = GenerateJsonSchema()
gen2._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'ser_json_bytes': 'base64'}))
result2 = gen2.bytes_schema(core_schema.bytes_schema())
print('base64 mode:', result2)

Test hex mode

gen3 = GenerateJsonSchema()
gen3._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'ser_json_bytes': 'hex'}))
result3 = gen3.bytes_schema(core_schema.bytes_schema())
print('hex mode:', result3)
" 2>&1

Completed: Run tests to verify (3/3)

All three modes produce correct output:

Mode	Before	After
`utf8` (default)	`{'type': 'string', 'format': 'binary'}`	`{'type': 'string'}`
`base64`	`{'type': 'string', 'format': 'base64url'}`	`{'type': 'string', 'contentEncoding': 'base64'}`
`hex`	`{'type': 'string', 'format': 'binary'}` (wrong)	`{'type': 'string', 'contentEncoding': 'base16'}`

Changes made:

json_schema.py — bytes_schema() now uses the JSON Schema 2020-12 contentEncoding keyword instead of non-standard format values. base64 mode uses contentEncoding: 'base64', hex mode uses contentEncoding: 'base16' (per RFC 4648), and utf8 mode emits a plain {'type': 'string'} since no encoding is applied.
test_json_schema.py — Updated 9 test expectations to match the new schema output: removed 'format': 'binary' references, replaced 'format': 'base64url' with 'contentEncoding': 'base64'.

Note: The Base64Bytes/Base64UrlBytes types in types.py were not changed — those use their own get_json_format() methods and emit format: 'base64' / format: 'base64url' as custom pydantic-specific format values for explicitly base64-typed fields, which is a separate concern.

Made changes.

User: Is "hex" used anywhere else? Or did you just add it?

GitHub Copilot: I added the hex branch. The original code was:

json_schema = {'type': 'string', 'format': 'base64url' if self._config.ser_json_bytes == 'base64' else 'binary'}

That ternary lumped both utf8 and hex into the else branch, giving them both format: 'binary'. The hex mode is a real config option — it's defined as Literal['utf8', 'base64', 'hex'] in config.py — but the old code never distinguished it from utf8 in the JSON schema output.

I added the explicit hex → contentEncoding: 'base16' mapping since base16 is the standard name for hex encoding per RFC 4648, and contentEncoding is the correct 2020-12 keyword for it. If you'd prefer to keep the scope narrower and only fix the base64/utf8 cases, I can remove the hex branch and leave it as a plain {'type': 'string'} (same as utf8).

User: Don't add the hex one to keep changes minimal

GitHub Copilot: Replacing 6 lines with 4 lines in

Done. The hex branch is removed — hex and utf8 both produce a plain {'type': 'string'} now.

Made changes.

User: You are not using contentMediaType: application/octet-stream, why?

GitHub Copilot: Good catch — contentMediaType tells consumers what the decoded content actually is. For bytes, that's application/octet-stream. Without it, a consumer knows the string is base64-encoded but not that the result is raw binary. Let me add it.

Replacing 4 lines with 5 lines in

Now update the test expectations to include contentMediaType:

Searched for text contentEncoding (**/tests/test_json_schema.py), 3 results

Replacing 15 lines with 15 lines in

Let me verify the output:

Ran terminal command: cd /home/user/code/pydantic && python -c "
from pydantic.json_schema import GenerateJsonSchema
from pydantic_core import core_schema
from pydantic._internal import _config

gen = GenerateJsonSchema()
gen._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'ser_json_bytes': 'base64'}))
print(gen.bytes_schema(core_schema.bytes_schema()))
"

Output is now:

{"type": "string", "contentEncoding": "base64", "contentMediaType": "application/octet-stream"}

This is the fully idiomatic JSON Schema 2020-12 representation for base64-encoded binary data.

Made changes.

User: Add the contentMediaType to the non-base64 version as well

GitHub Copilot: Read , lines 785 to 800

Replacing 5 lines with 4 lines in

Now update all the test expectations to include contentMediaType for non-base64 bytes too:

Let me read the full context around each of these test locations:

Read , lines 870 to 900

Read , lines 518 to 530

Read , lines 1875 to 1970

Read , lines 2015 to 2030

Now let me also read the constrained bytes test and the serialization constraints test:

Read , lines 2290 to 2310

Read , lines 2118 to 2135

Read , lines 524 to 560

Now I have all the locations. Every bytes schema without ser_json_bytes='base64' needs contentMediaType added:

Replacing 102 lines with 102 lines in

Let me verify:

Ran terminal command: cd /home/user/code/pydantic && python -c "
from pydantic.json_schema import GenerateJsonSchema
from pydantic_core import core_schema
from pydantic._internal import _config

utf8 (default)

gen = GenerateJsonSchema()
print('utf8:', gen.bytes_schema(core_schema.bytes_schema()))

base64

gen2 = GenerateJsonSchema()
gen2._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'ser_json_bytes': 'base64'}))
print('base64:', gen2.bytes_schema(core_schema.bytes_schema()))
"

Done. Both modes now include contentMediaType: 'application/octet-stream':

utf8: {'type': 'string', 'contentMediaType': 'application/octet-stream'}
base64: {'type': 'string', 'contentMediaType': 'application/octet-stream', 'contentEncoding': 'base64'}

Made changes.

User: Consider val_json_bytes as well

GitHub Copilot: Searched for text val_json_bytes (**/pydantic/**/*.py), 6 results

Read , lines 610 to 650

Read , lines 780 to 800

Searched for regex self\.mode|self\._mode (**/pydantic/json_schema.py), 10 results

The logic is clear: in validation mode, the schema should reflect what's accepted as input (governed by val_json_bytes), while in serialization mode it should reflect the output format (governed by ser_json_bytes).

Replacing 4 lines with 5 lines in

Ran terminal command: cd /home/user/code/pydantic && python -c "
from pydantic.json_schema import GenerateJsonSchema
from pydantic_core import core_schema
from pydantic._internal import _config

validation mode with val_json_bytes=base64

gen = GenerateJsonSchema()
gen._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'val_json_bytes': 'base64'}))
gen._mode = 'validation'
print('validation, val=base64:', gen.bytes_schema(core_schema.bytes_schema()))

serialization mode with ser_json_bytes=base64

gen2 = GenerateJsonSchema()
gen2._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'ser_json_bytes': 'base64'}))
gen2._mode = 'serialization'
print('serialization, ser=base64:', gen2.bytes_schema(core_schema.bytes_schema()))

validation mode with val_json_bytes=utf8 (default)

gen3 = GenerateJsonSchema()
gen3._mode = 'validation'
print('validation, val=utf8:', gen3.bytes_schema(core_schema.bytes_schema()))

serialization mode with ser_json_bytes=utf8 (default)

gen4 = GenerateJsonSchema()
gen4._mode = 'serialization'
print('serialization, ser=utf8:', gen4.bytes_schema(core_schema.bytes_schema()))

mixed: val=base64 but ser=utf8, validation mode

gen5 = GenerateJsonSchema()
gen5._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'val_json_bytes': 'base64', 'ser_json_bytes': 'utf8'}))
gen5._mode = 'validation'
print('validation, val=base64/ser=utf8:', gen5.bytes_schema(core_schema.bytes_schema()))

mixed: val=utf8 but ser=base64, serialization mode

gen6 = GenerateJsonSchema()
gen6._config_wrapper_stack = _config.ConfigWrapperStack(_config.ConfigWrapper({'val_json_bytes': 'utf8', 'ser_json_bytes': 'base64'}))
gen6._mode = 'serialization'
print('serialization, val=utf8/ser=base64:', gen6.bytes_schema(core_schema.bytes_schema()))
"

Working correctly. The schema now uses:

val_json_bytes in validation mode (what encoding the input is expected in)
ser_json_bytes in serialization mode (what encoding the output will be in)

This means mixed configs work properly — e.g., val_json_bytes='base64' + ser_json_bytes='utf8' will show contentEncoding: 'base64' only in the validation schema, not the serialization schema.

Made changes.

User: Now run the tests to check it all works, use the venv at .venv

GitHub Copilot: Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py -x -q -o "addopts=" 2>&1 | tail -30

A SecretBytes test also picks up contentMediaType. Let me check it:

Read , lines 980 to 1010

The SecretBytes test expects just {'type': 'string', 'writeOnly': True, 'format': 'password'} but now bytes_schema adds contentMediaType. Let me check how SecretBytes generates its schema — it likely goes through bytes_schema internally:

Searched for regex SecretBytes|secret_bytes|secret.*bytes (**/pydantic/**/*.py), 20 results

Searched for regex class _SecretField|__get_pydantic_json_schema__|pydantic_js_functions|writeOnly|password (**/pydantic/types.py), 20 results

Read , lines 1746 to 1800

I see — SecretBytes calls handler(cls._inner_schema) where _inner_schema is a bytes core schema, so it goes through bytes_schema() and picks up the contentMediaType. Then it overlays type='string', writeOnly=True, format='password'. The contentMediaType bleeds through.

The fix here is to update the test expectation, since SecretBytes does represent binary data (bytes) and contentMediaType: 'application/octet-stream' is semantically correct for it too.

Replacing 13 lines with 15 lines in

Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py -x -q -o "addopts=" 2>&1 | tail -30

Another SecretBytes test. Let me find and fix it:

Read , lines 4620 to 4670

Replacing 12 lines with 16 lines in

Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py -x -q -o "addopts=" 2>&1 | tail -20

All 532 tests pass. Let me also run the types tests since those had base64/base64url format expectations:

Ran terminal command: .venv/bin/python -m pytest tests/test_types.py -x -q -o "addopts=" -k "base64 or bytes" 2>&1 | tail -20

The Base64Bytes/Base64UrlBytes types also go through bytes_schema and inherit contentMediaType. Let me fix these test expectations:

Read , lines 5595 to 5625

Read , lines 5690 to 5720

Replacing 34 lines with 36 lines in

Ran terminal command: .venv/bin/python -m pytest tests/test_types.py -x -q -o "addopts=" -k "base64 or bytes" 2>&1 | tail -15

Still failing at the same line. Let me check the actual file content to make sure the edit took:

Read , lines 5595 to 5625

The edit is there. The issue is likely dict key ordering — JSON schema sort alphabetizes keys. Let me check what the actual output looks like:

Ran terminal command: .venv/bin/python -m pytest tests/test_types.py::test_base64 -x -q -o "addopts=" --no-header -vv 2>&1 | head -60

The issue is Base64Str — it's not bytes, it's a string type. It doesn't go through bytes_schema, so it doesn't get contentMediaType. The test is parametrized over both Base64Bytes and Base64Str, and they share the same assertion. I need to check what Base64Str actually produces:

Ran terminal command: .venv/bin/python -c "
from pydantic import BaseModel
from pydantic.types import Base64Bytes, Base64Str
from typing import Optional

class M1(BaseModel):
v: Base64Bytes

class M2(BaseModel):
v: Base64Str

print('Base64Bytes:', M1.model_json_schema())
print('Base64Str:', M2.model_json_schema())
"

Base64Bytes gets contentMediaType but Base64Str doesn't (since it's a string, not bytes). The test uses the same assertion for both. I need to make the assertion conditional:

Read , lines 5566 to 5620

Replacing 18 lines with 23 lines in

Now do the same for the test_base64url test:

Read , lines 5656 to 5720

Same pattern — Base64UrlStr won't have contentMediaType:

Replacing 18 lines with 23 lines in

Ran terminal command: .venv/bin/python -m pytest tests/test_types.py -x -q -o "addopts=" -k "base64 or bytes" 2>&1 | tail -10

Now let me run the full test suites for both files:

Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py tests/test_types.py -x -q -o "addopts=" 2>&1 | tail -10

Another test. Let me check it:

Read , lines 3530 to 3570

Interesting — os.PathLike[bytes] generates the schema through bytes_schema too and picks up contentMediaType. But its JSON schema output uses format: 'path' which overrides the type semantics. Let me check what actually gets output:

Continue to iterate?
[object Object]

Made changes.

User: @agent Continue: "Continue to iterate?"

GitHub Copilot: Ran terminal command: .venv/bin/python -c "
from pydantic import BaseModel
import os
class M(BaseModel):
byte_type: os.PathLike[bytes]
print(M.model_json_schema())
"

Replacing 10 lines with 10 lines in

Summarized conversation history

Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py tests/test_types.py -x -q -o "addopts="

All 1483 tests pass (with 4 skipped, 2 expected failures). The bytes_schema changes are fully validated.

Made changes.

User: Update the tests in test_types.py in test_base64url for JSON Schema with base64 data to use the standard JSON Schema format

With contentMediaType and contentEncoding instead of format

GitHub Copilot: Read , lines 5679 to 5735

Read , lines 5638 to 5678

The test is parametrized with both Base64UrlBytes and Base64UrlStr. Only Base64UrlBytes goes through bytes_schema() and should use contentEncoding/contentMediaType, while Base64UrlStr keeps format: 'base64url'. I need to make the schema assertion conditional, like was done for test_base64.

Let me check how test_base64 handles this:

Read , lines 5575 to 5625

Let me check how Base64UrlBytes/Base64UrlStr generate their schemas to understand what output they actually produce:

Searched for text Base64Url (**/pydantic/types.py), 20 results

Read , lines 2400 to 2480

Read , lines 2480 to 2600

Let me check the current actual schema output for these types:

Ran terminal command: cd /home/user/code/pydantic && .venv/bin/python -c "
from pydantic import BaseModel, Base64UrlBytes, Base64UrlStr
from typing import Optional

class M1(BaseModel):
v: Base64UrlBytes
v2: Optional[Base64UrlBytes] = None

class M2(BaseModel):
v: Base64UrlStr
v2: Optional[Base64UrlStr] = None

import json
print('Base64UrlBytes:', json.dumps(M1.model_json_schema(), indent=2))
print()
print('Base64UrlStr:', json.dumps(M2.model_json_schema(), indent=2))
"

Now I see the issue — EncodedBytes.__get_pydantic_json_schema__ adds format on top of the contentMediaType from bytes_schema(). I need to update both EncodedBytes and EncodedStr to use contentEncoding instead of format, then update the test.

Read , lines 2570 to 2610

Read , lines 2610 to 2660

Created 5 todos

Now I'll update both EncodedBytes and EncodedStr in types.py to use contentEncoding instead of format, and update the test assertions in test_base64url (and test_base64 for consistency).

Replacing 20 lines with 20 lines in

Starting: Update test_base64url assertions (3/5)

Now let me verify the new output:

Ran terminal command: cd /home/user/code/pydantic && .venv/bin/python -c "
from pydantic import BaseModel, Base64UrlBytes, Base64UrlStr, Base64Bytes, Base64Str
from typing import Optional
import json

class M1(BaseModel):
v: Base64UrlBytes
v2: Optional[Base64UrlBytes] = None

class M2(BaseModel):
v: Base64UrlStr
v2: Optional[Base64UrlStr] = None

print('Base64UrlBytes:', json.dumps(M1.model_json_schema(), indent=2))
print()
print('Base64UrlStr:', json.dumps(M2.model_json_schema(), indent=2))
"

Base64UrlBytes now has contentEncoding + contentMediaType, Base64UrlStr has contentEncoding only (no contentMediaType since it's not binary data). Now I'll update the test assertions. Let me also check test_base64:

Read , lines 5586 to 5622

Both test_base64 and test_base64url need updating. Since they're parametrized with both Bytes and Str variants, and the schema differs (contentMediaType present only for bytes), I'll make the assertions conditional.

Replacing 34 lines with 42 lines in

Starting: Run tests to verify (5/5)

Ran terminal command: .venv/bin/python -m pytest tests/test_types.py -x -q -o "addopts=" -k "test_base64"

All 29 base64 tests pass. Now let me run the full test suites to check for any other tests affected:

Ran terminal command: .venv/bin/python -m pytest tests/test_json_schema.py tests/test_types.py -x -q -o "addopts="

Completed: Run tests to verify (5/5)

All 1480 tests pass. Here's a summary of the changes:

Implementation (types.py):

EncodedBytes.__get_pydantic_json_schema__ — changed format= to contentEncoding=
EncodedStr.__get_pydantic_json_schema__ — changed format= to contentEncoding=

This means Base64UrlBytes now produces:

{"type": "string", "contentEncoding": "base64url", "contentMediaType": "application/octet-stream"}

(contentMediaType comes from bytes_schema(), contentEncoding from the encoder)

And Base64UrlStr produces:

{"type": "string", "contentEncoding": "base64url"}

Tests (test_types.py):

test_base64 and test_base64url — schema assertions updated to expect contentEncoding instead of format, with conditional contentMediaType for the bytes variants.

Made changes.

Related issue number

Checklist

The pull request title is a good summary of the changes - it will be used in the changelog
Unit tests for the changes exist
Tests pass on CI
Documentation reflects the changes where applicable
My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @Viicos

…ion/octet-stream"` instead of `"format": "binary"`

codspeed-hq · 2026-02-21T08:37:36Z

Merging this PR will degrade performance by 7.72%

❌ 1 regressed benchmark
✅ 211 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`test_tagged_union_with_callable_discriminator_schema_generation`	1.1 ms	1.1 ms	-7.72%

_{Comparing tiangolo:content-media-type (69710dd) with main (46dea92)}

github-actions · 2026-02-21T12:19:51Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
pydantic
json_schema.py
types.py
Project Total

_{This report was generated by python-coverage-comment-action}

tiangolo · 2026-02-21T13:00:34Z

please review

Viicos · 2026-03-05T10:44:00Z

    ) -> JsonSchemaValue:
        field_schema = handler(core_schema)
-        field_schema.update(type='string', format=self.encoder.get_json_format())
+        field_schema.update(type='string', contentEncoding=self.encoder.get_json_format())


I think this is too breaking to be included in V2. Here is what I propose:

EncoderProtocol.get_json_format()'s return type is changed from str to str | None. If str is returned, is is used as the format key. A new get_content_encoding() method is added, which also returns str | None. The Pydantic built-in EncoderProtocol implementations implements get_content_encoding() and does not implement get_json_format() anymore.

Makes sense! I also reverted the custom logic I had in FastAPI because Swagger UI currently only supports the format key.

I added/kept both. Not fully sure if this is what you wanted, let me know if it was different.

tiangolo · 2026-04-03T13:07:23Z

please review

♻️ Refactor JSON Schema for bytes, use `"contentMediaType": "applicat…

12e7e05

…ion/octet-stream"` instead of `"format": "binary"`

github-actions Bot added the relnotes-fix Used for bugfixes. label Feb 21, 2026

tiangolo added 4 commits February 21, 2026 12:43

♻️ Handle base64 JSON Schema for validation and serialization

417f6e2

🐛 Fix contentEncoding of base64 types

401c2f6

📝 Update docs with bytes JSON Schema

d24c8ab

📝 Update docs format

773f949

tiangolo marked this pull request as ready for review February 21, 2026 13:00

pydantic-hooky Bot added the ready for review label Feb 21, 2026

pydantic-hooky Bot assigned Viicos Feb 21, 2026

Viicos requested changes Mar 5, 2026

View reviewed changes

♻️ Refactor with code review, keep format, add contentEncoding

69710dd

pydantic-hooky Bot added ready for review and removed awaiting author revision awaiting changes from the PR author labels Apr 3, 2026

Viicos added the deferred Deferred until future release or until something else gets done label Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

♻️ Refactor JSON Schema for bytes, use `"contentMediaType": "application/octet-stream"` instead of `"format": "binary"`#12841

♻️ Refactor JSON Schema for bytes, use `"contentMediaType": "application/octet-stream"` instead of `"format": "binary"`#12841
tiangolo wants to merge 6 commits into
pydantic:mainfrom
tiangolo:content-media-type

tiangolo commented Feb 21, 2026 •

edited by pydantic-hooky Bot

Loading

Uh oh!

codspeed-hq Bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

tiangolo commented Feb 21, 2026

Uh oh!

Viicos Mar 5, 2026

Uh oh!

tiangolo Apr 3, 2026

Uh oh!

tiangolo commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tiangolo commented Feb 21, 2026 • edited by pydantic-hooky Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary

JSON Schema 2020-12 Note

Docs

Backwards Compatibility and Versions

AI Disclaimer

Already Conformant

Conformance Issues

1. Missing $schema keyword (medium severity)

2. bytes_schema uses format instead of contentEncoding (medium severity)

3. OpenAPI discriminator in JSON Schema output (low severity)

Missed Opportunities

4. Nullable types default to verbose anyOf instead of type array

5. No use of if/then/else

6. No use of unevaluatedProperties/unevaluatedItems

7. No use of $dynamicRef/$dynamicAnchor

Recommended Priority for Upgrades

Test default (utf8) mode

Test base64 mode

Test hex mode

utf8 (default)

base64

validation mode with val_json_bytes=base64

serialization mode with ser_json_bytes=base64

validation mode with val_json_bytes=utf8 (default)

serialization mode with ser_json_bytes=utf8 (default)

mixed: val=base64 but ser=utf8, validation mode

mixed: val=utf8 but ser=base64, serialization mode

Related issue number

Checklist

Uh oh!

codspeed-hq Bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 7.72%

Performance Changes

Uh oh!

github-actions Bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

tiangolo commented Feb 21, 2026

Uh oh!

Viicos Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

tiangolo Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

tiangolo commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tiangolo commented Feb 21, 2026 •

edited by pydantic-hooky Bot

Loading

1. Missing `$schema` keyword (medium severity)

2. `bytes_schema` uses `format` instead of `contentEncoding` (medium severity)

3. OpenAPI `discriminator` in JSON Schema output (low severity)

4. Nullable types default to verbose `anyOf` instead of `type` array

5. No use of `if`/`then`/`else`

6. No use of `unevaluatedProperties`/`unevaluatedItems`

7. No use of `$dynamicRef`/`$dynamicAnchor`

codspeed-hq Bot commented Feb 21, 2026 •

edited

Loading

github-actions Bot commented Feb 21, 2026 •

edited

Loading