🔢 Tokenizer — Universal AI Token Counter

Fast, accurate, 100% client‑side token counting for GPT, Llama, Mistral, Qwen, DeepSeek, Phi, and more — all in your browser. No API keys. No backend. Just drop text or files and get counts, context usage, and optional cost estimates instantly.

🔗 Live Demo: https://nepomuceno.github.io/tokenizer/

_{Deploys automatically on push to main.}

✨ Features

Area	Highlights
Tokenization	Supports GPT (tiktoken), Llama, Mistral, Qwen, DeepSeek, Phi via HuggingFace tokenizers
File Ingestion	PDF, Markdown, TXT, DOCX, CSV — fully client‑side extraction
Performance	Web Worker offload + lazy WASM loading; handles large files in slices
Context Awareness	Shows model context limit + progress bar + over‑limit warning
Cost Estimation	Editable per‑1K price; totals by file and aggregate
Batch Mode	Drag & drop multiple files with per‑file + total counts
Token Preview	Color‑coded token splits for insight & debugging
Offline‑Ready	Static build suitable for GitHub Pages; no server dependency
UX & Accessibility	Keyboard friendly, ARIA labels, dark mode planned

🧠 How It Works

User pastes text or drops files.
Parsers extract plain text (PDF via pdfjs-dist, DOCX via mammoth, Markdown via remark, raw readers for TXT/CSV).
Text is chunked (100–500KiB) and sent to a Web Worker.
The worker loads the appropriate tokenizer adapter:
- @dqbd/tiktoken for OpenAI‑style models (WASM)
- @huggingface/tokenizers JSON for Llama/Mistral/Qwen/Phi/DeepSeek
The adapter counts tokens and streams progress back.
UI displays per‑file stats, model context usage, optional price math, and token preview.

All processing stays in the browser. Nothing is uploaded.

🗂 Model Registry

Models are defined centrally in src/modelRegistry.ts with: key, adapter type, resource (internal tokenizer name or JSON path), context window, and optional pricing reference. This keeps logic declarative and avoids hard‑coding throughout the UI.

🚀 Quick Start (Local)

Prereqs: Bun (https://bun.sh) installed.

git clone https://github.com/Nepomuceno/tokenizer.git
cd tokenizer
bun install
bun run dev

Build:

bun run build

Preview production build:

bun run preview

Run tests:

bun run test

Lint:

bun run lint

🧪 Testing Philosophy

Minimal, focused Vitest tests cover tokenizer adapters and integration boundaries. When adding a new model or parser:

Add adapter tests (src/test/*).
Add a small fixture (avoid large binaries in repo).
Ensure deterministic count vs. known sample.

📐 Architecture Overview

React UI (components)  --->  Tokenizer Factory  --->  Adapter (tiktoken | hf)  --->  WASM / JSON tokenizer data
        |                        ^                        
        |                        |                        
File Parsers (pdf/markdown/docx/text) ----> Web Worker (tokenize.worker.ts) <---- Model Registry (ctx + meta)

Key principles:

Keep UI pure & declarative.
Lazy load heavyweight assets (WASM / tokenizer JSON).
Use transferable data (ArrayBuffer) when scaling.
Avoid blocking the main thread.

🛠 Tech Stack

Category	Tooling
Framework	React + TypeScript + Vite
Runtime / PM	Bun
Styling	Tailwind CSS
Tokenizers	`@dqbd/tiktoken`, HuggingFace tokenizer JSONs
Parsing	`pdfjs-dist`, `remark` + `remark-parse`, `mammoth`
Testing	Vitest + Testing Library
Deployment	GitHub Pages via Actions

🧩 Adding a New Model

Drop tokenizer JSON in public/tokenizers/ (if HF style).
Add entry to modelRegistry.ts with key + ctx size.
(Optional) Add pricing metadata.
Write/extend adapter tests.
Build & verify counts.

🤝 Contributing

📦 Deployment Notes (GitHub Pages)

The app is deployed under a subpath (https://<user>.github.io/tokenizer/). To keep static assets resolving correctly:

vite.config.ts uses base: './' so built asset references are relative.
All favicon / manifest links in index.html use ./ prefixes (no leading /).
public/manifest.json sets icons[].src, start_url, and scope to ./ forms.
The tiktoken adapter builds its local encodings path from import.meta.env.BASE_URL so it fetches ./encodings/*.tiktoken instead of hitting the domain root (/encodings), avoiding 404 + CORS when hosted in a subfolder.
Avoid introducing absolute leading-slash asset paths unless you change deployment to the domain root.

If you fork and deploy at another subpath, no changes are needed; if you deploy at root, everything still works because relative paths resolve there as well.

Contributions welcome! See CONTRIBUTING.md for detailed setup, guidelines, and PR checklist.

Please follow conventional commits where practical (e.g., feat: add qwen tokenizer).

🔐 Security

Report vulnerabilities privately via GitHub Security Advisories or the email listed in SECURITY.md. Avoid opening public issues for sensitive disclosures.

☕️ Support

If this project saves you time, you can buy me a coffee:

https://buymeacoffee.com/gabrielbici

Sharing the repo also helps a ton! 💙

📜 License

Licensed under the MIT License.

🙏 Acknowledgements

Inspired by community token counters and upstream tokenizer libraries. Thanks to:

@dqbd/tiktoken
HuggingFace tokenizers
Open‑source parser ecosystems (pdfjs, remark, mammoth)

🗺 Roadmap (High‑Level)

_{Have an idea? Open a discussion or PR!}

Star History

Built with ❤️ for the AI developer community.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
public		public
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
bun.lock		bun.lock
eslint.config.js		eslint.config.js
index.html		index.html
llmsTokenizer.csv		llmsTokenizer.csv
package.json		package.json
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🔢 Tokenizer — Universal AI Token Counter

✨ Features

🧠 How It Works

🗂 Model Registry

🚀 Quick Start (Local)

🧪 Testing Philosophy

📐 Architecture Overview

🛠 Tech Stack

🧩 Adding a New Model

🤝 Contributing

📦 Deployment Notes (GitHub Pages)

🔐 Security

☕️ Support

📜 License

🙏 Acknowledgements

🗺 Roadmap (High‑Level)

Star History

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

Nepomuceno/tokenizer

Folders and files

Latest commit

History

Repository files navigation

🔢 Tokenizer — Universal AI Token Counter

✨ Features

🧠 How It Works

🗂 Model Registry

🚀 Quick Start (Local)

🧪 Testing Philosophy

📐 Architecture Overview

🛠 Tech Stack

🧩 Adding a New Model

🤝 Contributing

📦 Deployment Notes (GitHub Pages)

🔐 Security

☕️ Support

📜 License

🙏 Acknowledgements

🗺 Roadmap (High‑Level)

Star History

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages