Small Large Concept Model (LCM)

This repository is a pure, lightweight implementation of a Large Concept Model-style pipeline in PyTorch. It is designed to be small and fast (about 5M parameters in this implementation profile), making it practical for experimentation on limited hardware.

Instead of predicting the next word token directly, the model works in a sentence-level concept space (dense embeddings), then learns to predict the next concept representation.

Concept Overview

The core idea is:

map text to concept embeddings (sentence vectors),
model concept transitions autoregressively,
optionally decode or compare predicted concepts downstream.

Your project illustration:

This figure highlights the same high-level design: words are mapped into concept space, the concept model predicts future concepts, and outputs can be mapped back to text-level behavior.

How This Repo Relates to Official LCM

From the official LCM project, the relevant takeaway is that LCMs operate in a sentence representation space (SONAR concept space) and can be trained as concept-level sequence models rather than token-level LMs.
The official work also explores multiple variants (including MSE and diffusion-style approaches) at very large scale.

This repository keeps only the practical, minimal core:

compact architecture,
easier training loop,
faster iteration cycle,
small parameter budget.

Reference: facebookresearch/large_concept_model

Why a Small (5M) Concept Model?

Fast experimentation: quicker training and debugging.
Lower compute cost: easier to run on a single GPU or CPU for prototypes.
Simple baseline: clear foundation before scaling to larger concept models.
Educational clarity: code is short enough to inspect end to end.

Architecture at a Glance

Implemented in src/base_lcm.py:

SonarEncoder: converts sentences into dense concept embeddings.
PreNet: projects input embedding dimension into model hidden space.
TransformerDecoder: autoregressive concept transition model (causal mask).
PostNet: projects hidden states back to output concept embedding space.
BaseLCM: wraps PreNet -> TransformerDecoder -> PostNet.

Diagram:

flowchart LR
    A[Input Sentences] --> B[SONAR Encoder]
    B --> C[Concept Embeddings]
    C --> D[PreNet]
    D --> E[Transformer Decoder<br/>causal concept modeling]
    E --> F[PostNet]
    F --> G[Predicted Next Concept Embedding]

Training Pipeline

Training logic is in src/train.py.

Steps

load text samples from a Hugging Face dataset,
split text into sentences with spaCy,
encode sentences into concept embeddings (SonarEncoder),
add controlled noise to create targets,
optimize MSE loss between predicted and target concept embeddings,
save checkpoint to saved_models/base_lcm_model.pth.

flowchart TD
    A[HF Dataset] --> B[Sentence Split]
    B --> C[SONAR Encoding]
    C --> D[Input Embeddings]
    D --> E[BaseLCM]
    C --> F[Noise Injection]
    F --> G[Target Embeddings]
    E --> H[MSE Loss with Targets]
    H --> I[Backprop + AdamW]
    I --> J[Checkpoint Saved]

Repository Structure

large_concept_model_tt/
├── img/
│   └── LCM.gif
├── src/
│   ├── base_lcm.py
│   ├── train.py
│   ├── test.py
│   └── utils.py
├── saved_models/
├── requirements.txt
└── README.md

Installation

Requirements:

Python 3.12+
PyTorch-compatible environment (GPU optional, recommended for speed)

Install dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Run Training

Minimal example:

PYTHONPATH=src python -m src.train --hf_data "wikitext" --text_column "text" --data_sample 1000 --epoch 3

Extended example:

PYTHONPATH=src python -m src.train \
  --hf_data "wikitext" \
  --text_column "text" \
  --lang "en" \
  --batch_size 8 \
  --sequence_length 10 \
  --input_dim 256 \
  --hidden_dim 512 \
  --num_heads 8 \
  --num_layers 6 \
  --ff_dim 2048 \
  --output_dim 256 \
  --lr 0.001 \
  --weight_decay 1e-4 \
  --noise_level 0.05 \
  --data_sample 1000

Checkpoint output:

saved_models/base_lcm_model.pth

Inference / Quick Test

Run:

PYTHONPATH=src python src/test.py

src/test.py will:

load a saved checkpoint,
infer model config from state dict shapes,
encode a sample prompt into concept space,
produce and print predicted concept embedding output.

Results (Template)

Use this section as a lightweight experiment log for reproducibility.

Training Loss

Run ID	Dataset	Samples	Epochs	Batch Size	LR	Final Train Loss	Notes
baseline-001	wikitext	1000	3	8	1e-3	0.0418	first stable run
exp-002	wikitext	2000	5	8	5e-4	0.0297	lower LR + more data improved stability

Sample Outputs

Input Prompt	Expected Semantic Direction	Predicted Output (short)	Comment
"What are the causes of climate change?"	science / environment explanation	fill from model output	check topical coherence
"How can I learn Python quickly?"	practical step-by-step advice	fill from model output	check actionable structure

Quick Interpretation

Lower MSE generally indicates better embedding-space reconstruction.
Also validate semantic quality manually (not just loss), since embedding-space fit does not always guarantee best text-level usefulness.

Optional W&B Logging

Enable in training with:

--wandb True

Then authenticate once:

wandb login

Known Limitations

Small model capacity: the ~5M parameter setup is fast, but may underfit complex long-range concept dynamics.
No official benchmark parity: this repo is a compact educational/experimental implementation, not a reproduction of large-scale LCM training results.
Embedding-space objective gap: MSE on embeddings may not perfectly align with downstream text generation quality.
Single checkpoint workflow: training currently saves one primary checkpoint path (saved_models/base_lcm_model.pth) without richer checkpoint management.
Device selection behavior: CUDA device handling in src/train.py is hardcoded (CUDA_VISIBLE_DEVICES="1,2,3"), which may not match all environments.
Data preprocessing simplicity: sentence splitting and sampling are basic; performance can vary significantly with better curation and preprocessing.

Practical Notes

--hf_data is required for dataset loading.
Tune --batch_size and --data_sample for your memory budget.
The code currently sets CUDA_VISIBLE_DEVICES="1,2,3" when CUDA is available.
This implementation is intentionally compact and optimized for speed and iteration, not for reproducing large-scale training runs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Large Concept Model (LCM)

Concept Overview

How This Repo Relates to Official LCM

Why a Small (5M) Concept Model?

Architecture at a Glance

Training Pipeline

Steps

Repository Structure

Installation

Run Training

Inference / Quick Test

Results (Template)

Training Loss

Sample Outputs

Quick Interpretation

Optional W&B Logging

Known Limitations

Practical Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Small Large Concept Model (LCM)

Concept Overview

How This Repo Relates to Official LCM

Why a Small (5M) Concept Model?

Architecture at a Glance

Training Pipeline

Steps

Repository Structure

Installation

Run Training

Inference / Quick Test

Results (Template)

Training Loss

Sample Outputs

Quick Interpretation

Optional W&B Logging

Known Limitations

Practical Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages