Add Efficiency Formalization #221

mateolafalce · 2025-11-24T21:36:44Z

Work related with #187

Looking forward to your thoughts !

johannschopplich · 2025-11-25T10:28:15Z

Hey there, I really like this page – both the idea and the execution!

At a high level, this does exactly what I'd hope for from an "advanced" doc: it gives a clean, byte‑level model of JSON vs TOON for a few canonical shapes, and it lands on the same story the rest of the docs tell informally:

TOON's best case is tabular arrays (uniform arrays of objects).
Plain objects and primitive arrays also benefit.
Arrays‑of‑arrays are a real worst case where TOON adds overhead instead of saving it.

That's very on‑brand for TOON and fits nicely alongside the empirical Benchmarks page. I'd definitely like to merge this in once it's tightened a bit.

Here are the main things I'd tweak before calling it "done":

1. Make the modelling assumptions explicit

Right now the math implicitly assumes a few things that are reasonable, but they should be spelled out so readers don't overgeneralize:

Compact JSON vs pretty JSON
All the formulas are for compact JSON (no spaces/newlines outside strings), but several examples are pretty‑printed. I'd add a short note in "Methodology" like:

In this analysis, JSON is assumed to be compact (no spaces or newlines outside strings). Examples are formatted for readability, but byte counts are computed on the compact form.
Canonical TOON formatting
The formulas assume canonical TOON (indent=2, one space after :, no spaces after commas in arrays/field lists), while some examples currently include extra spaces (foo, bar, baz, id, qty). Either:
- normalize the examples to canonical TOON, or
- explicitly say "we ignore optional spaces that don't change the structure for the purpose of these counts".
ASCII + "nice" strings
You define |x|_utf8, but later effectively treat it as "character count". I'd add something like:

For simplicity, we assume keys and structural tokens are ASCII, so byte length equals character count. Non‑ASCII characters affect both formats similarly and don't change the structural conclusions.

And maybe one sentence that we only look at "clean" keys/values that don't trigger extra quoting beyond the special "looks like number/boolean" case you already call out.
Indentation depth
You mention that nesting gets more expensive due to indentation, but the displayed formula for nested objects doesn't have a depth parameter. I'd either:
- explicitly say "this formula is for a single nesting level; deeper nesting adds 2 spaces per indented line if we assume indentSize=2", or
- extend the formula to include depth d and factor 2d bytes of indentation per nested line.

These clarifications don't change your conclusions, they just make it clear what world the formulas live in.

2. Tone down the "bytes → tokens" and "always better" claims

Two bits of wording I'd soften so they match the rest of the docs:

Bytes vs tokens
Right now it sounds like fewer bytes always means fewer tokens. That's heuristically true, but tokenizers are a bit weird. I'd rephrase to something like:

Modern LLM tokenizers operate over UTF‑8 bytes, so byte length is a good upper bound and first‑order proxy for token count, even though the mapping isn't exactly linear. For measured token counts, see the Benchmarks page.
"TOON objectively reduces the byte-size … except arrays of arrays"
In practice we already admit there are structures (deeply nested, zero tabular eligibility, etc.) where compact JSON can win. This page is more constrained (nice keys, nice values, specific structure families), so I'd say:

Within this simplified model, and for the structure families analyzed here, TOON is strictly shorter than compact JSON in all cases except arrays of arrays.

That keeps the math precise without over‑promising.

3. A couple of math / notation cleanups

Nothing major here, just tightening:

Separate "value length" vs "field line length"
Earlier you define L_json(O) and L_toon(O) in terms of L_(…)(v_i) as value lengths. Later, in the array sections, L_toon(A) starts including the key, brackets and colon on the same line. That makes the recursion a bit hard to follow.

Easiest fix is probably to say explicitly: "In this section, L_toon(A) refers to the length of the whole field line key[N]: …, not just the array value," and avoid plugging it back into the top‑level recursive equation. That way you don't have to fully refactor the notation.
Consistent symbols
You sometimes write |x|_utf8 and sometimes ||x||_utf8. I'd pick one (the single bars you used in the definitions are fine).
Indentation term in tabular arrays
You currently have 2n for indents, which assumes indentSize=2 and one indent level. I'd either explicitly set indentSize=2:

Assuming 2‑space indentation, each of the n rows contributes 2 bytes of indentation, i.e., a total of 2n.

or write it as indentSize * n.

Again, this is more about readability than correctness.

4. Examples and numbers lining up

A few of the "Empirical Validation" snippets have pretty formatting that doesn't quite match the byte counts you quote (e.g. tags[3]: foo, bar, baz vs tags[3]: foo,bar,baz). For someone who tries to count characters by hand, that's a bit confusing.

I'd either:

switch all the examples to the exact strings you use in the math (no extra spaces), or
explicitly say "formatted for readability, numbers are computed on the compact form".

5. Small editorial suggestion

Two small presentational tweaks that might make this page even more approachable:

Label it clearly as Informative / Advanced (non‑normative), and link it from the Benchmarks page as "formal byte‑level model" so people know where to find it.
Consider moving a short "Key findings" bullet list (the three–four bullets from your Conclusion) up into the Overview so busy readers can get the punchline without reading the derivations.

Overall: I'd really like to merge this. The structure is solid, the conclusions line up with how we already talk about TOON, and it's especially nice that you're honest about arrays‑of‑arrays being worse. With the assumptions, wording, and a couple of small math/formatting nits cleaned up, this will be a great reference for readers who want the "why" behind TOON's efficiency, not just the benchmark charts! 🚀

mateolafalce · 2025-11-25T15:39:34Z

Thanks for the detailed feedback! I really appreciate the effort you put into checking it.

I’ve updated the PR to address all points. Here is a summary:

Make the modelling assumptions explicit

Added a section clarifying assumptions about Compact JSON, Canonical TOON, ASCII/UTF-8, and Byte vs Token counts.
Explicitly stated the indentation assumptions (2 spaces).

Tone down claims

Adjusted the wording to be more precise regarding byte/token.
- Clarified that byte length is a "good upper bound and first-order proxy" for tokens.
- Specified that TOON is shorter "within this simplified model, and for the structure families analyzed here."

Math Notation cleanups

Standardized on $|x|_{utf8}$.
Explicitly defined $L_{toon}(A)$ in the array sections to avoid recursion confusion.

Examples and numbers

Updated examples to remove extra spaces, ensuring they strictly match the byte counts derived in the formulas.

Editorial suggestion

Added a bulleted list of key findings to the Overview
Added a Related Resources section at the end of the page linking to the Formal Byte-Level Model

Happy to iterate further. Let me know if you spot anything else I might have missed or if you'd like any more tweaks before merging.

johannschopplich · 2025-11-26T10:23:38Z

The overhaul looks good so far. I think it's essentially ready. I have only a few minor polish tweaks before merging:

Conclusion wording (bytes → tokens, scope)
In the Conclusion you currently say:

“Since UTF‑8 byte length serves as the base unit for tokenization, the reduction in (L_{\text{toon}}) directly correlates to a decrease in token count. Therefore, excluding the 'arrays of arrays' edge case, TOON provides an objective improvement…”

To keep this aligned with the assumptions note and the Benchmarks page, could you soften this slightly, e.g.:
- "serves as a good upper bound and first‑order proxy for tokenization" instead of "directly correlates", and
- "…typically leads to fewer tokens for the structure families analyzed in this model…" instead of an unconditional "objective improvemen".
Also nice would be a brief reminder there that this is "within this simplified model and the structure families considered here," so expectations stay calibrated.
Restrict L_num to non‑negative integers
In the integer length definition, you define (L_{\text{num}}(n)) for (n = 0) and (n > 0), but don't mention the sign case. Since all your uses are counts/lengths anyway, a one‑liner like:

"We restrict (n) to non‑negative integers in this analysis."

right under the formula would close that gap cleanly.
Small notation cleanups
There are just a couple of stray places where the norm bars \|x\| slipped back in:
- JSON primitive table: all the \|v_i\|_{\text{utf8}} entries.
- TOON primitive table: same.
- Summary table, “Tabular Arrays” row: 3+\|k\|.
Could you change those to |v_i|_{\text{utf8}} / |k| for consistency with the earlier definition?

Also, in the tabular‑arrays formula:
```
\sum_{i=1}^{m} L_{\text{toon}}(k_i)
```
would be clearer as:
```
\sum_{i=1}^{m} L_{\text{str}}(k_i)
```
since you’ve already defined L_str specifically for keys/strings and L_toon is used recursively for full values/structures.
Explicitly mark as informative / non‑normative
At the top you already say "This is an advanced reference,' which is great. 🙌 To avoid any confusion with the spec, a short addition like:

"This page is informative and non‑normative; it does not change the TOON specification."

in the NOTE block (or right below it) would make that completely explicit.

mateolafalce · 2025-11-26T23:08:05Z

Conclusion wording (bytes → tokens, scope)

Soften the conclusion: '''Since UTF-8 byte length serves as a good upper bound and first-order proxy for tokenization, the reduction in $L_{\text{toon}}$ typically leads to fewer tokens for the structure families analyzed in this model. Therefore, excluding the "arrays of arrays" edge case, TOON provides a significant efficiency improvement within this simplified model and the structure families considered here, lowering inference and serialization costs.'''

Restrict L_num to non‑negative integers

'''Let $n \in \mathbb{Z}_{\ge 0}$ be a non-negative integer. The number of bytes required to represent $n$ in decimal format is:'''

Small notation cleanups

Fix $$\sum_{i=1}^{m} L_{\text{toon}}(k_i)$$ -> $$\sum_{i=1}^{m} L_{\text{str}}(k_i)$$

I've change JSON primitive table, TOON primitive table and Tabular Arrays in the Summary table with |v_i|_{\text{utf8}} ^ |k|.

Explicitly mark as informative / non‑normative

Note

This page presents formal mathematical comparisons between TOON and JSON. For practical benchmarks and token counts, see Benchmarks. This is an advanced reference.

This page is informative and non-normative; it does not change the TOON specification.

Thanks for the detailed feedback ! 💪

mateolafalce · 2025-11-26T23:27:46Z

The rendering seems broken for the JSON, TOON, and Summary tables after the update

johannschopplich · 2025-11-27T06:42:27Z

I see! I think that's coming from the Markdown table parser, not from the math feature itself.

When you changed \|v_i\|_{\text{utf8}} to |v_i|_{\text{utf8}}, the literal | characters are now being interpreted as column separators inside the table. Can you try using: $L_{\text{str}}(v_i) = \lvert v_i\rvert_{\text{utf8}} + 2$ and the same for the other formats?

JSON: \lvert v_i\rvert_{\text{utf8}}
etc.

mateolafalce · 2025-11-27T11:57:21Z

Fixed

Type	Formula
String	$L_{\text{str}}(v_i) = \lvert v_i\rvert_{\text{utf8}} + 2$
Number	$L_{\text{num}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$
Boolean	$L_{\text{bool}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$
Null	$L_{\text{null}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$

johannschopplich · 2025-11-27T21:28:57Z

@mateolafalce Awesome work. Thank you! I have done some touch-ups to the wording. Could you check out my latest changes, please? Are they suiting your idea?

I've also slightly updated the title to: "TOON vs JSON: Byte-Level Efficiency Model" (it is still called "Efficiency Formalization" in the sidebar). I think that captures your core idea better, do you agree?

Also, I've added you as the author of the page. Is that OK for you?

mateolafalce · 2025-11-27T21:37:47Z

I’ve checked the changes, and they look great. I also agree with the new title, it captures the core concept much more precisely.

Being listed as the author is perfectly fine with me. Thanks @johannschopplich for your help and feedback! 🙌

johannschopplich · 2025-11-28T06:43:18Z

Thank you for your contribution. 🙂

Add Efficiency Formalization

6ee1ab7

update eff.md && bach.md

6735b6b

update eff.md

93bc713

update literal pipes

64599c2

johannschopplich added 2 commits November 27, 2025 22:22

docs: overhaul some wording

aa02f2d

docs: credit @mateolafalce as author

0fa2a5e

johannschopplich added 2 commits November 28, 2025 07:46

chore: deps update

f8d6e0a

docs: updates

27d10b6

johannschopplich merged commit 412ebcb into toon-format:main Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Efficiency Formalization #221

Add Efficiency Formalization #221

Uh oh!

mateolafalce commented Nov 24, 2025

Uh oh!

johannschopplich commented Nov 25, 2025

Uh oh!

mateolafalce commented Nov 25, 2025

Uh oh!

johannschopplich commented Nov 26, 2025

Uh oh!

mateolafalce commented Nov 26, 2025

Uh oh!

mateolafalce commented Nov 26, 2025

Uh oh!

johannschopplich commented Nov 27, 2025

Uh oh!

mateolafalce commented Nov 27, 2025

Uh oh!

johannschopplich commented Nov 27, 2025

Uh oh!

mateolafalce commented Nov 27, 2025

Uh oh!

johannschopplich commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Efficiency Formalization #221

Add Efficiency Formalization #221

Uh oh!

Conversation

mateolafalce commented Nov 24, 2025

Uh oh!

johannschopplich commented Nov 25, 2025

1. Make the modelling assumptions explicit

2. Tone down the "bytes → tokens" and "always better" claims

3. A couple of math / notation cleanups

4. Examples and numbers lining up

5. Small editorial suggestion

Uh oh!

mateolafalce commented Nov 25, 2025

Uh oh!

johannschopplich commented Nov 26, 2025

Uh oh!

mateolafalce commented Nov 26, 2025

Uh oh!

mateolafalce commented Nov 26, 2025

Uh oh!

johannschopplich commented Nov 27, 2025

Uh oh!

mateolafalce commented Nov 27, 2025

Uh oh!

johannschopplich commented Nov 27, 2025

Uh oh!

mateolafalce commented Nov 27, 2025

Uh oh!

johannschopplich commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants