-
Notifications
You must be signed in to change notification settings - Fork 930
Add Efficiency Formalization #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Efficiency Formalization #221
Conversation
|
Hey there, I really like this page – both the idea and the execution! At a high level, this does exactly what I'd hope for from an "advanced" doc: it gives a clean, byte‑level model of JSON vs TOON for a few canonical shapes, and it lands on the same story the rest of the docs tell informally:
That's very on‑brand for TOON and fits nicely alongside the empirical Benchmarks page. I'd definitely like to merge this in once it's tightened a bit. Here are the main things I'd tweak before calling it "done": 1. Make the modelling assumptions explicitRight now the math implicitly assumes a few things that are reasonable, but they should be spelled out so readers don't overgeneralize:
These clarifications don't change your conclusions, they just make it clear what world the formulas live in. 2. Tone down the "bytes → tokens" and "always better" claimsTwo bits of wording I'd soften so they match the rest of the docs:
That keeps the math precise without over‑promising. 3. A couple of math / notation cleanupsNothing major here, just tightening:
Again, this is more about readability than correctness. 4. Examples and numbers lining upA few of the "Empirical Validation" snippets have pretty formatting that doesn't quite match the byte counts you quote (e.g. I'd either:
5. Small editorial suggestionTwo small presentational tweaks that might make this page even more approachable:
Overall: I'd really like to merge this. The structure is solid, the conclusions line up with how we already talk about TOON, and it's especially nice that you're honest about arrays‑of‑arrays being worse. With the assumptions, wording, and a couple of small math/formatting nits cleaned up, this will be a great reference for readers who want the "why" behind TOON's efficiency, not just the benchmark charts! 🚀 |
|
Thanks for the detailed feedback! I really appreciate the effort you put into checking it. I’ve updated the PR to address all points. Here is a summary:
Happy to iterate further. Let me know if you spot anything else I might have missed or if you'd like any more tweaks before merging. |
|
The overhaul looks good so far. I think it's essentially ready. I have only a few minor polish tweaks before merging:
|
Soften the conclusion: '''Since UTF-8 byte length serves as a good upper bound and first-order proxy for tokenization, the reduction in
'''Let
Fix I've change JSON primitive table, TOON primitive table and Tabular Arrays in the Summary table with
Note This page presents formal mathematical comparisons between TOON and JSON. For practical benchmarks and token counts, see Benchmarks. This is an advanced reference. This page is informative and non-normative; it does not change the TOON specification. Thanks for the detailed feedback ! 💪 |
|
I see! I think that's coming from the Markdown table parser, not from the math feature itself. When you changed
|
|
Fixed
|
|
@mateolafalce Awesome work. Thank you! I have done some touch-ups to the wording. Could you check out my latest changes, please? Are they suiting your idea? I've also slightly updated the title to: "TOON vs JSON: Byte-Level Efficiency Model" (it is still called "Efficiency Formalization" in the sidebar). I think that captures your core idea better, do you agree? Also, I've added you as the author of the page. Is that OK for you? |
|
I’ve checked the changes, and they look great. I also agree with the new title, it captures the core concept much more precisely. Being listed as the author is perfectly fine with me. Thanks @johannschopplich for your help and feedback! 🙌 |
|
Thank you for your contribution. 🙂 |
Work related with #187
Looking forward to your thoughts !