Fast, accurate, 100% client‑side token counting for GPT, Llama, Mistral, Qwen, DeepSeek, Phi, and more — all in your browser. No API keys. No backend. Just drop text or files and get counts, context usage, and optional cost estimates instantly.
🔗 Live Demo: https://nepomuceno.github.io/tokenizer/
Deploys automatically on push to main.
| Area | Highlights |
|---|---|
| Tokenization | Supports GPT (tiktoken), Llama, Mistral, Qwen, DeepSeek, Phi via HuggingFace tokenizers |
| File Ingestion | PDF, Markdown, TXT, DOCX, CSV — fully client‑side extraction |
| Performance | Web Worker offload + lazy WASM loading; handles large files in slices |
| Context Awareness | Shows model context limit + progress bar + over‑limit warning |
| Cost Estimation | Editable per‑1K price; totals by file and aggregate |
| Batch Mode | Drag & drop multiple files with per‑file + total counts |
| Token Preview | Color‑coded token splits for insight & debugging |
| Offline‑Ready | Static build suitable for GitHub Pages; no server dependency |
| UX & Accessibility | Keyboard friendly, ARIA labels, dark mode planned |
- User pastes text or drops files.
- Parsers extract plain text (PDF via
pdfjs-dist, DOCX viamammoth, Markdown viaremark, raw readers for TXT/CSV). - Text is chunked (100–500KiB) and sent to a Web Worker.
- The worker loads the appropriate tokenizer adapter:
@dqbd/tiktokenfor OpenAI‑style models (WASM)@huggingface/tokenizersJSON for Llama/Mistral/Qwen/Phi/DeepSeek
- The adapter counts tokens and streams progress back.
- UI displays per‑file stats, model context usage, optional price math, and token preview.
All processing stays in the browser. Nothing is uploaded.
Models are defined centrally in src/modelRegistry.ts with: key, adapter type, resource (internal tokenizer name or JSON path), context window, and optional pricing reference. This keeps logic declarative and avoids hard‑coding throughout the UI.
Prereqs: Bun (https://bun.sh) installed.
git clone https://github.com/Nepomuceno/tokenizer.git
cd tokenizer
bun install
bun run devBuild:
bun run buildPreview production build:
bun run previewRun tests:
bun run testLint:
bun run lintMinimal, focused Vitest tests cover tokenizer adapters and integration boundaries. When adding a new model or parser:
- Add adapter tests (
src/test/*). - Add a small fixture (avoid large binaries in repo).
- Ensure deterministic count vs. known sample.
React UI (components) ---> Tokenizer Factory ---> Adapter (tiktoken | hf) ---> WASM / JSON tokenizer data
| ^
| |
File Parsers (pdf/markdown/docx/text) ----> Web Worker (tokenize.worker.ts) <---- Model Registry (ctx + meta)
Key principles:
- Keep UI pure & declarative.
- Lazy load heavyweight assets (WASM / tokenizer JSON).
- Use transferable data (ArrayBuffer) when scaling.
- Avoid blocking the main thread.
| Category | Tooling |
|---|---|
| Framework | React + TypeScript + Vite |
| Runtime / PM | Bun |
| Styling | Tailwind CSS |
| Tokenizers | @dqbd/tiktoken, HuggingFace tokenizer JSONs |
| Parsing | pdfjs-dist, remark + remark-parse, mammoth |
| Testing | Vitest + Testing Library |
| Deployment | GitHub Pages via Actions |
- Drop tokenizer JSON in
public/tokenizers/(if HF style). - Add entry to
modelRegistry.tswith key + ctx size. - (Optional) Add pricing metadata.
- Write/extend adapter tests.
- Build & verify counts.
The app is deployed under a subpath (https://<user>.github.io/tokenizer/). To keep static assets resolving correctly:
vite.config.tsusesbase: './'so built asset references are relative.- All favicon / manifest links in
index.htmluse./prefixes (no leading/). public/manifest.jsonsetsicons[].src,start_url, andscopeto./forms.- The tiktoken adapter builds its local encodings path from
import.meta.env.BASE_URLso it fetches./encodings/*.tiktokeninstead of hitting the domain root (/encodings), avoiding 404 + CORS when hosted in a subfolder. - Avoid introducing absolute leading-slash asset paths unless you change deployment to the domain root.
If you fork and deploy at another subpath, no changes are needed; if you deploy at root, everything still works because relative paths resolve there as well.
Contributions welcome! See CONTRIBUTING.md for detailed setup, guidelines, and PR checklist.
Please follow conventional commits where practical (e.g., feat: add qwen tokenizer).
Report vulnerabilities privately via GitHub Security Advisories or the email listed in SECURITY.md. Avoid opening public issues for sensitive disclosures.
If this project saves you time, you can buy me a coffee:
https://buymeacoffee.com/gabrielbici
Sharing the repo also helps a ton! 💙
Licensed under the MIT License.
Inspired by community token counters and upstream tokenizer libraries. Thanks to:
@dqbd/tiktoken- HuggingFace
tokenizers - Open‑source parser ecosystems (pdfjs, remark, mammoth)
- Dark mode toggle
- PWA / offline caching
- Custom user‑uploaded tokenizer JSON
- Cost presets per vendor
- CSV Export of batch results
Have an idea? Open a discussion or PR!