Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mateolafalce
Copy link
Contributor

Work related with #187

Looking forward to your thoughts !

@johannschopplich
Copy link
Collaborator

Hey there, I really like this page – both the idea and the execution!

At a high level, this does exactly what I'd hope for from an "advanced" doc: it gives a clean, byte‑level model of JSON vs TOON for a few canonical shapes, and it lands on the same story the rest of the docs tell informally:

  • TOON's best case is tabular arrays (uniform arrays of objects).
  • Plain objects and primitive arrays also benefit.
  • Arrays‑of‑arrays are a real worst case where TOON adds overhead instead of saving it.

That's very on‑brand for TOON and fits nicely alongside the empirical Benchmarks page. I'd definitely like to merge this in once it's tightened a bit.

Here are the main things I'd tweak before calling it "done":

1. Make the modelling assumptions explicit

Right now the math implicitly assumes a few things that are reasonable, but they should be spelled out so readers don't overgeneralize:

  • Compact JSON vs pretty JSON
    All the formulas are for compact JSON (no spaces/newlines outside strings), but several examples are pretty‑printed. I'd add a short note in "Methodology" like:

    In this analysis, JSON is assumed to be compact (no spaces or newlines outside strings). Examples are formatted for readability, but byte counts are computed on the compact form.

  • Canonical TOON formatting
    The formulas assume canonical TOON (indent=2, one space after :, no spaces after commas in arrays/field lists), while some examples currently include extra spaces (foo, bar, baz, id, qty). Either:

    • normalize the examples to canonical TOON, or
    • explicitly say "we ignore optional spaces that don't change the structure for the purpose of these counts".
  • ASCII + "nice" strings
    You define |x|_utf8, but later effectively treat it as "character count". I'd add something like:

    For simplicity, we assume keys and structural tokens are ASCII, so byte length equals character count. Non‑ASCII characters affect both formats similarly and don't change the structural conclusions.

    And maybe one sentence that we only look at "clean" keys/values that don't trigger extra quoting beyond the special "looks like number/boolean" case you already call out.

  • Indentation depth
    You mention that nesting gets more expensive due to indentation, but the displayed formula for nested objects doesn't have a depth parameter. I'd either:

    • explicitly say "this formula is for a single nesting level; deeper nesting adds 2 spaces per indented line if we assume indentSize=2", or
    • extend the formula to include depth d and factor 2d bytes of indentation per nested line.

These clarifications don't change your conclusions, they just make it clear what world the formulas live in.

2. Tone down the "bytes → tokens" and "always better" claims

Two bits of wording I'd soften so they match the rest of the docs:

  • Bytes vs tokens
    Right now it sounds like fewer bytes always means fewer tokens. That's heuristically true, but tokenizers are a bit weird. I'd rephrase to something like:

    Modern LLM tokenizers operate over UTF‑8 bytes, so byte length is a good upper bound and first‑order proxy for token count, even though the mapping isn't exactly linear. For measured token counts, see the Benchmarks page.

  • "TOON objectively reduces the byte-size … except arrays of arrays"
    In practice we already admit there are structures (deeply nested, zero tabular eligibility, etc.) where compact JSON can win. This page is more constrained (nice keys, nice values, specific structure families), so I'd say:

    Within this simplified model, and for the structure families analyzed here, TOON is strictly shorter than compact JSON in all cases except arrays of arrays.

That keeps the math precise without over‑promising.

3. A couple of math / notation cleanups

Nothing major here, just tightening:

  • Separate "value length" vs "field line length"
    Earlier you define L_json(O) and L_toon(O) in terms of L_(…)(v_i) as value lengths. Later, in the array sections, L_toon(A) starts including the key, brackets and colon on the same line. That makes the recursion a bit hard to follow.

    Easiest fix is probably to say explicitly: "In this section, L_toon(A) refers to the length of the whole field line key[N]: …, not just the array value," and avoid plugging it back into the top‑level recursive equation. That way you don't have to fully refactor the notation.

  • Consistent symbols
    You sometimes write |x|_utf8 and sometimes ||x||_utf8. I'd pick one (the single bars you used in the definitions are fine).

  • Indentation term in tabular arrays
    You currently have 2n for indents, which assumes indentSize=2 and one indent level. I'd either explicitly set indentSize=2:

    Assuming 2‑space indentation, each of the n rows contributes 2 bytes of indentation, i.e., a total of 2n.

    or write it as indentSize * n.

Again, this is more about readability than correctness.

4. Examples and numbers lining up

A few of the "Empirical Validation" snippets have pretty formatting that doesn't quite match the byte counts you quote (e.g. tags[3]: foo, bar, baz vs tags[3]: foo,bar,baz). For someone who tries to count characters by hand, that's a bit confusing.

I'd either:

  • switch all the examples to the exact strings you use in the math (no extra spaces), or
  • explicitly say "formatted for readability, numbers are computed on the compact form".

5. Small editorial suggestion

Two small presentational tweaks that might make this page even more approachable:

  • Label it clearly as Informative / Advanced (non‑normative), and link it from the Benchmarks page as "formal byte‑level model" so people know where to find it.
  • Consider moving a short "Key findings" bullet list (the three–four bullets from your Conclusion) up into the Overview so busy readers can get the punchline without reading the derivations.

Overall: I'd really like to merge this. The structure is solid, the conclusions line up with how we already talk about TOON, and it's especially nice that you're honest about arrays‑of‑arrays being worse. With the assumptions, wording, and a couple of small math/formatting nits cleaned up, this will be a great reference for readers who want the "why" behind TOON's efficiency, not just the benchmark charts! 🚀

@mateolafalce
Copy link
Contributor Author

Thanks for the detailed feedback! I really appreciate the effort you put into checking it.

I’ve updated the PR to address all points. Here is a summary:

  1. Make the modelling assumptions explicit
  • Added a section clarifying assumptions about Compact JSON, Canonical TOON, ASCII/UTF-8, and Byte vs Token counts.
  • Explicitly stated the indentation assumptions (2 spaces).
  1. Tone down claims
  • Adjusted the wording to be more precise regarding byte/token.
    • Clarified that byte length is a "good upper bound and first-order proxy" for tokens.
    • Specified that TOON is shorter "within this simplified model, and for the structure families analyzed here."
  1. Math Notation cleanups
  • Standardized on $|x|_{utf8}$.
  • Explicitly defined $L_{toon}(A)$ in the array sections to avoid recursion confusion.
  1. Examples and numbers
  • Updated examples to remove extra spaces, ensuring they strictly match the byte counts derived in the formulas.
  1. Editorial suggestion
  • Added a bulleted list of key findings to the Overview
  • Added a Related Resources section at the end of the page linking to the Formal Byte-Level Model

Happy to iterate further. Let me know if you spot anything else I might have missed or if you'd like any more tweaks before merging.

@johannschopplich
Copy link
Collaborator

The overhaul looks good so far. I think it's essentially ready. I have only a few minor polish tweaks before merging:

  1. Conclusion wording (bytes → tokens, scope)
    In the Conclusion you currently say:

    “Since UTF‑8 byte length serves as the base unit for tokenization, the reduction in (L_{\text{toon}}) directly correlates to a decrease in token count. Therefore, excluding the 'arrays of arrays' edge case, TOON provides an objective improvement…”

    To keep this aligned with the assumptions note and the Benchmarks page, could you soften this slightly, e.g.:

    • "serves as a good upper bound and first‑order proxy for tokenization" instead of "directly correlates", and
    • "…typically leads to fewer tokens for the structure families analyzed in this model…" instead of an unconditional "objective improvemen".

    Also nice would be a brief reminder there that this is "within this simplified model and the structure families considered here," so expectations stay calibrated.

  2. Restrict L_num to non‑negative integers
    In the integer length definition, you define (L_{\text{num}}(n)) for (n = 0) and (n > 0), but don't mention the sign case. Since all your uses are counts/lengths anyway, a one‑liner like:

    "We restrict (n) to non‑negative integers in this analysis."

    right under the formula would close that gap cleanly.

  3. Small notation cleanups
    There are just a couple of stray places where the norm bars \|x\| slipped back in:

    • JSON primitive table: all the \|v_i\|_{\text{utf8}} entries.
    • TOON primitive table: same.
    • Summary table, “Tabular Arrays” row: 3+\|k\|.

    Could you change those to |v_i|_{\text{utf8}} / |k| for consistency with the earlier definition?

    Also, in the tabular‑arrays formula:

    \sum_{i=1}^{m} L_{\text{toon}}(k_i)

    would be clearer as:

    \sum_{i=1}^{m} L_{\text{str}}(k_i)

    since you’ve already defined L_str specifically for keys/strings and L_toon is used recursively for full values/structures.

  4. Explicitly mark as informative / non‑normative
    At the top you already say "This is an advanced reference,' which is great. 🙌 To avoid any confusion with the spec, a short addition like:

    "This page is informative and non‑normative; it does not change the TOON specification."

    in the NOTE block (or right below it) would make that completely explicit.

@mateolafalce
Copy link
Contributor Author

Conclusion wording (bytes → tokens, scope)

Soften the conclusion: '''Since UTF-8 byte length serves as a good upper bound and first-order proxy for tokenization, the reduction in $L_{\text{toon}}$ typically leads to fewer tokens for the structure families analyzed in this model. Therefore, excluding the "arrays of arrays" edge case, TOON provides a significant efficiency improvement within this simplified model and the structure families considered here, lowering inference and serialization costs.'''

Restrict L_num to non‑negative integers

'''Let $n \in \mathbb{Z}_{\ge 0}$ be a non-negative integer. The number of bytes required to represent $n$ in decimal format is:'''

Small notation cleanups

Fix $$\sum_{i=1}^{m} L_{\text{toon}}(k_i)$$ -> $$\sum_{i=1}^{m} L_{\text{str}}(k_i)$$

I've change JSON primitive table, TOON primitive table and Tabular Arrays in the Summary table with |v_i|_{\text{utf8}} ^ |k|.

Explicitly mark as informative / non‑normative

Note

This page presents formal mathematical comparisons between TOON and JSON. For practical benchmarks and token counts, see Benchmarks. This is an advanced reference.

This page is informative and non-normative; it does not change the TOON specification.

Thanks for the detailed feedback ! 💪

@mateolafalce
Copy link
Contributor Author

image

The rendering seems broken for the JSON, TOON, and Summary tables after the update

@johannschopplich
Copy link
Collaborator

I see! I think that's coming from the Markdown table parser, not from the math feature itself.

When you changed \|v_i\|_{\text{utf8}} to |v_i|_{\text{utf8}}, the literal | characters are now being interpreted as column separators inside the table. Can you try using: $L_{\text{str}}(v_i) = \lvert v_i\rvert_{\text{utf8}} + 2$ and the same for the other formats?

  • JSON: \lvert v_i\rvert_{\text{utf8}}
  • etc.

@mateolafalce
Copy link
Contributor Author

Fixed

Type Formula
String $L_{\text{str}}(v_i) = \lvert v_i\rvert_{\text{utf8}} + 2$
Number $L_{\text{num}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$
Boolean $L_{\text{bool}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$
Null $L_{\text{null}}(v_i) = \lvert v_i\rvert_{\text{utf8}}$

@johannschopplich
Copy link
Collaborator

@mateolafalce Awesome work. Thank you! I have done some touch-ups to the wording. Could you check out my latest changes, please? Are they suiting your idea?

I've also slightly updated the title to: "TOON vs JSON: Byte-Level Efficiency Model" (it is still called "Efficiency Formalization" in the sidebar). I think that captures your core idea better, do you agree?

Also, I've added you as the author of the page. Is that OK for you?

@mateolafalce
Copy link
Contributor Author

I’ve checked the changes, and they look great. I also agree with the new title, it captures the core concept much more precisely.

Being listed as the author is perfectly fine with me. Thanks @johannschopplich for your help and feedback! 🙌

@johannschopplich
Copy link
Collaborator

Thank you for your contribution. 🙂

@johannschopplich johannschopplich merged commit 412ebcb into toon-format:main Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants