Thanks to visit codestin.com
Credit goes to github.com

Skip to content

alexgarabt/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec Explorer

A from-scratch TypeScript implementation of Word2Vec Skip-Gram with Negative Sampling (Mikolov et al. 2013)

demo_word2vec-2026-02-20_12.36.33.mp4

What is Word2Vec?

Word2Vec learns a dense vector (embedding) for every word in a corpus so that words used in similar contexts end up close together in vector space.

The skip-gram variant works like this: for every word in the text, look at the surrounding words (the "context window") and train a model to predict those neighbors. Instead of a full softmax over the entire vocabulary (expensive), we use negative sampling — for each real (word, context) pair, we sample k random "negative" words and train the model to distinguish the real pair from the fake ones.

Embeddings in a vector space

Implementation

  1. Tokenizer lowercase words.
  2. Build vocabulary
  3. Build an alias table (Walker 1977)
  4. Initialize embeddings
  5. Train

Getting Started

pnpm install
pnpm dev          # starts the UI at http://localhost:5173
pnpm test         # runs all module unit tests
pnpm typecheck    # typechecks both module and ui

Project Structure

word2vec/
├── module/              @word2vec/module — core library, zero dependencies
│   └── src/
│       ├── word2vec.ts    training loop (generator), mostSimilar, analogy
│       ├── math.ts        sigmoid, dot, cosineSimilarity
│       ├── tokenizer.ts   tokenize, countWords
│       ├── alias.ts       Walker's alias method for O(1) sampling
│       ├── projection.ts  PCA and t-SNE for 2D visualization
│       ├── types.ts       all type definitions
│       └── __tests__/     32 unit tests (vitest)
└── ui/                  @word2vec/ui — Vite + React 19 + Tailwind v4
    └── src/
        ├── workers/       Web Worker for off-thread training
        ├── hooks/         useTraining (state machine via useReducer)
        ├── pages/         TrainPage, ExplorePage
        └── components/    FileUpload, ConfigPanel, LossChart, EmbeddingMap, etc.

Module API

import {
  tokenize, countWords,
  train, mostSimilar, analogy, getEmbedding,
  pcaProject2D, tsneProject2D,
} from "@word2vec/module";

Training

const tokens = tokenize("the king rules the kingdom ...");
const wordCounts = countWords(tokens);

// train() is a generator — iterate to drive training
const gen = train(tokens, wordCounts, {
  d: 100,       // embedding dimensions
  L: 5,         // context window half-size
  k: 10,        // negative samples per positive pair
  eta: 0.025,   // initial learning rate
  epochs: 5,    // full passes over the corpus
  minCount: 2,  // discard words with count below this
  debug: true,  // yield per-pair debug events
});

let model;
for (const event of gen) {
  if (event.type === "init")  console.log(`Vocab: ${event.vocabSize}`);
  if (event.type === "epoch") console.log(`Epoch ${event.epoch} — loss: ${event.avgLoss.toFixed(4)}`);
  if (event.type === "done")  model = event.model;
}

Querying

// Top 10 most similar words
mostSimilar(model, "king", 10);
// → [{ word: "queen", similarity: 0.69 }, ...]

// Analogy: king is to queen as man is to ???
analogy(model, "king", "queen", "man", 5);
// → [{ word: "woman", similarity: 0.58 }, ...]

// Raw embedding vector
getEmbedding(model, "king");
// → number[100]

// Sum of target + context embeddings (sometimes better)
getEmbedding(model, "king", true);

2D Projection

// PCA — fast, deterministic
const points: [number, number][] = pcaProject2D(model.W);

// t-SNE — slower, better local structure
const tsnePoints = tsneProject2D(model.W, 30, 500, 10);
//                                        perplexity, iterations, learningRate

Config Defaults

Parameter Default Description
d 100 Embedding dimensions
L 5 Max context window half-size
k 10 Negative samples per positive pair
alpha 0.75 Smoothing exponent for P_alpha
eta 0.025 Initial learning rate
epochs 5 Full passes over the corpus
minCount 2 Min word frequency to keep
debug false Emit per-pair debug events

References

About

A from-scratch TypeScript implementation of word2Vec Skip-Gram with Negative Sampling (Mikolov et al. 2013)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages