tokenx

Fast and lightweight token count estimation for any LLM without requiring a full tokenizer. This library provides quick approximations that are good enough for most use cases while keeping your bundle size minimal.

For advanced use cases requiring precise token counts, please use a full tokenizer like gpt-tokenizer.

Benchmarks

The following table shows the accuracy of the token count approximation for different input texts:

Description	Actual GPT Token Count	Estimated Token Count	Token Count Deviation
Short English text	10	11	10.00%
German text with umlauts	48	49	2.08%
Metamorphosis by Franz Kafka (English)	31796	35705	12.29%
Die Verwandlung by Franz Kafka (German)	35309	35069	0.68%
道德經 by Laozi (Chinese)	11712	12059	2.96%
TypeScript ES5 Type Declarations (~ 4000 loc)	49304	52615	6.72%

Features

⚡ 94% accuracy compared to full tokenizers (see benchmarks below)
📦 Just 2kB bundle size with zero dependencies
🌍 Multi-language support with configurable language rules
🗣️ Built-in support for accented characters (German, French, Spanish, etc.)
🔧 Configurable and extensible

Installation

Run the following command to add tokenx to your project.

# npm
npm install tokenx

# pnpm
pnpm add tokenx

# yarn
yarn add tokenx

Usage

import { estimateTokenCount, isWithinTokenLimit, splitByTokens } from 'tokenx'

const text = 'Your text goes here.'

// Estimate the number of tokens in the text
const estimatedTokens = estimateTokenCount(text)
console.log(`Estimated token count: ${estimatedTokens}`)

// Check if text is within a specific token limit
const tokenLimit = 1024
const withinLimit = isWithinTokenLimit(text, tokenLimit)
console.log(`Is within token limit: ${withinLimit}`)

// Split text into token-based chunks
const chunks = splitByTokens(text, 100)
console.log(`Split into ${chunks.length} chunks`)

// Use custom options for different languages or models
const customOptions = {
  defaultCharsPerToken: 4, // More conservative estimation
  languageConfigs: [
    { pattern: /[你我他]/g, averageCharsPerToken: 1.5 }, // Custom Chinese rule
  ]
}

const customEstimate = estimateTokenCount(text, customOptions)
console.log(`Custom estimate: ${customEstimate}`)

API

`estimateTokenCount`

Estimates the number of tokens in a given input string using heuristic rules that work across multiple languages and text types.

Usage:

const estimatedTokens = estimateTokenCount('Hello, world!')

// With custom options
const customEstimate = estimateTokenCount('Bonjour le monde!', {
  defaultCharsPerToken: 4,
  languageConfigs: [
    { pattern: /[éèêëàâîï]/i, averageCharsPerToken: 3 }
  ]
})

Type Declaration:

function estimateTokenCount(
  text?: string,
  options?: TokenEstimationOptions
): number

interface TokenEstimationOptions {
  /** Default average characters per token when no language-specific rule applies */
  defaultCharsPerToken?: number
  /** Custom language configurations to override defaults */
  languageConfigs?: LanguageConfig[]
}

interface LanguageConfig {
  /** Regular expression to detect the language */
  pattern: RegExp
  /** Average number of characters per token for this language */
  averageCharsPerToken: number
}

`isWithinTokenLimit`

Checks if the estimated token count of the input is within a specified token limit.

Usage:

const withinLimit = isWithinTokenLimit('Check this text against a limit', 100)
// With custom options
const customCheck = isWithinTokenLimit('Text', 50, { defaultCharsPerToken: 3 })

Type Declaration:

function isWithinTokenLimit(
  text: string,
  tokenLimit: number,
  options?: TokenEstimationOptions
): boolean

`sliceByTokens`

Extracts a portion of text based on token positions, similar to Array.prototype.slice(). Supports both positive and negative indices.

Usage:

const text = 'Hello, world! This is a test sentence.'

const firstThree = sliceByTokens(text, 0, 3)
const fromSecond = sliceByTokens(text, 2)
const lastTwo = sliceByTokens(text, -2)
const middle = sliceByTokens(text, 1, -1)

// With custom options
const customSlice = sliceByTokens(text, 0, 5, {
  defaultCharsPerToken: 4,
  languageConfigs: [
    { pattern: /[éèêëàâîï]/i, averageCharsPerToken: 3 }
  ]
})

Type Declaration:

function sliceByTokens(
  text: string,
  start?: number,
  end?: number,
  options?: TokenEstimationOptions
): string

Parameters:

text - The input text to slice
start - The start token index (inclusive). If negative, treated as offset from end. Default: 0
end - The end token index (exclusive). If negative, treated as offset from end. If omitted, slices to the end
options - Token estimation options (same as estimateTokenCount)

Returns:

The sliced text portion corresponding to the specified token range.

`splitByTokens`

Splits text into chunks based on token count. Useful for chunking documents for RAG, batch processing, or staying within context windows.

Usage:

const text = 'Long text that needs to be split into smaller chunks...'

// Basic splitting
const chunks = splitByTokens(text, 100)
console.log(`Split into ${chunks.length} chunks`)

// With overlap for semantic continuity
const overlappedChunks = splitByTokens(text, 100, { overlap: 10 })

// With custom options
const customChunks = splitByTokens(text, 50, {
  defaultCharsPerToken: 4,
  overlap: 5
})

Type Declaration:

interface SplitByTokensOptions extends TokenEstimationOptions {
  /** Number of tokens to overlap between consecutive chunks (default: 0) */
  overlap?: number
}

function splitByTokens(
  text: string,
  tokensPerChunk: number,
  options?: SplitByTokensOptions
): string[]

Parameters:

text - The input text to split
tokensPerChunk - Maximum number of tokens per chunk
options - Token estimation options with optional overlap

Returns:

An array of text chunks, each containing approximately tokensPerChunk tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
tsdown.config.ts		tsdown.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tokenx

Benchmarks

Features

Installation

Usage

API

`estimateTokenCount`

`isWithinTokenLimit`

`sliceByTokens`

`splitByTokens`

License

About

Uh oh!

Releases 15

Packages

Used by 395

Contributors 2

Uh oh!

Languages

License

johannschopplich/tokenx

Folders and files

Latest commit

History

Repository files navigation

tokenx

Benchmarks

Features

Installation

Usage

API

estimateTokenCount

isWithinTokenLimit

sliceByTokens

splitByTokens

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Used by 395

Contributors 2

Uh oh!

Languages

`estimateTokenCount`

`isWithinTokenLimit`

`sliceByTokens`

`splitByTokens`

Packages