Word2Vec Explorer

A from-scratch TypeScript implementation of Word2Vec Skip-Gram with Negative Sampling (Mikolov et al. 2013)

demo_word2vec-2026-02-20_12.36.33.mp4

What is Word2Vec?

Word2Vec learns a dense vector (embedding) for every word in a corpus so that words used in similar contexts end up close together in vector space.

The skip-gram variant works like this: for every word in the text, look at the surrounding words (the "context window") and train a model to predict those neighbors. Instead of a full softmax over the entire vocabulary (expensive), we use negative sampling — for each real (word, context) pair, we sample k random "negative" words and train the model to distinguish the real pair from the fake ones.

Implementation

Tokenizer lowercase words.
Build vocabulary
Build an alias table (Walker 1977)
Initialize embeddings
Train

Getting Started

pnpm install
pnpm dev          # starts the UI at http://localhost:5173
pnpm test         # runs all module unit tests
pnpm typecheck    # typechecks both module and ui

Project Structure

word2vec/
├── module/              @word2vec/module — core library, zero dependencies
│   └── src/
│       ├── word2vec.ts    training loop (generator), mostSimilar, analogy
│       ├── math.ts        sigmoid, dot, cosineSimilarity
│       ├── tokenizer.ts   tokenize, countWords
│       ├── alias.ts       Walker's alias method for O(1) sampling
│       ├── projection.ts  PCA and t-SNE for 2D visualization
│       ├── types.ts       all type definitions
│       └── __tests__/     32 unit tests (vitest)
└── ui/                  @word2vec/ui — Vite + React 19 + Tailwind v4
    └── src/
        ├── workers/       Web Worker for off-thread training
        ├── hooks/         useTraining (state machine via useReducer)
        ├── pages/         TrainPage, ExplorePage
        └── components/    FileUpload, ConfigPanel, LossChart, EmbeddingMap, etc.

Module API

import {
  tokenize, countWords,
  train, mostSimilar, analogy, getEmbedding,
  pcaProject2D, tsneProject2D,
} from "@word2vec/module";

Training

const tokens = tokenize("the king rules the kingdom ...");
const wordCounts = countWords(tokens);

// train() is a generator — iterate to drive training
const gen = train(tokens, wordCounts, {
  d: 100,       // embedding dimensions
  L: 5,         // context window half-size
  k: 10,        // negative samples per positive pair
  eta: 0.025,   // initial learning rate
  epochs: 5,    // full passes over the corpus
  minCount: 2,  // discard words with count below this
  debug: true,  // yield per-pair debug events
});

let model;
for (const event of gen) {
  if (event.type === "init")  console.log(`Vocab: ${event.vocabSize}`);
  if (event.type === "epoch") console.log(`Epoch ${event.epoch} — loss: ${event.avgLoss.toFixed(4)}`);
  if (event.type === "done")  model = event.model;
}

Querying

// Top 10 most similar words
mostSimilar(model, "king", 10);
// → [{ word: "queen", similarity: 0.69 }, ...]

// Analogy: king is to queen as man is to ???
analogy(model, "king", "queen", "man", 5);
// → [{ word: "woman", similarity: 0.58 }, ...]

// Raw embedding vector
getEmbedding(model, "king");
// → number[100]

// Sum of target + context embeddings (sometimes better)
getEmbedding(model, "king", true);

2D Projection

// PCA — fast, deterministic
const points: [number, number][] = pcaProject2D(model.W);

// t-SNE — slower, better local structure
const tsnePoints = tsneProject2D(model.W, 30, 500, 10);
//                                        perplexity, iterations, learningRate

Config Defaults

Parameter	Default	Description
`d`	100	Embedding dimensions
`L`	5	Max context window half-size
`k`	10	Negative samples per positive pair
`alpha`	0.75	Smoothing exponent for P_alpha
`eta`	0.025	Initial learning rate
`epochs`	5	Full passes over the corpus
`minCount`	2	Min word frequency to keep
`debug`	false	Emit per-pair debug events

References

Mikolov et al. — Efficient Estimation of Word Representations in Vector Space (2013)
Mikolov et al. — Distributed Representations of Words and Phrases and their Compositionality (2013)
Goldberg & Levy — word2vec Explained (2014)
Walker — An Efficient Method for Generating Discrete Random Variables with General Distributions (1977)
van der Maaten & Hinton — Visualizing Data using t-SNE (2008)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
module		module
ui		ui
.gitignore		.gitignore
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec Explorer

What is Word2Vec?

Implementation

Getting Started

Project Structure

Module API

Training

Querying

2D Projection

Config Defaults

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Word2Vec Explorer

What is Word2Vec?

Implementation

Getting Started

Project Structure

Module API

Training

Querying

2D Projection

Config Defaults

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages