This repository is a pure, lightweight implementation of a Large Concept Model-style pipeline in PyTorch. It is designed to be small and fast (about 5M parameters in this implementation profile), making it practical for experimentation on limited hardware.
Instead of predicting the next word token directly, the model works in a sentence-level concept space (dense embeddings), then learns to predict the next concept representation.
The core idea is:
- map text to concept embeddings (sentence vectors),
- model concept transitions autoregressively,
- optionally decode or compare predicted concepts downstream.
Your project illustration:
This figure highlights the same high-level design: words are mapped into concept space, the concept model predicts future concepts, and outputs can be mapped back to text-level behavior.
From the official LCM project, the relevant takeaway is that LCMs operate in a sentence representation space (SONAR concept space) and can be trained as concept-level sequence models rather than token-level LMs.
The official work also explores multiple variants (including MSE and diffusion-style approaches) at very large scale.
This repository keeps only the practical, minimal core:
- compact architecture,
- easier training loop,
- faster iteration cycle,
- small parameter budget.
Reference: facebookresearch/large_concept_model
- Fast experimentation: quicker training and debugging.
- Lower compute cost: easier to run on a single GPU or CPU for prototypes.
- Simple baseline: clear foundation before scaling to larger concept models.
- Educational clarity: code is short enough to inspect end to end.
Implemented in src/base_lcm.py:
SonarEncoder: converts sentences into dense concept embeddings.PreNet: projects input embedding dimension into model hidden space.TransformerDecoder: autoregressive concept transition model (causal mask).PostNet: projects hidden states back to output concept embedding space.BaseLCM: wrapsPreNet -> TransformerDecoder -> PostNet.
Diagram:
flowchart LR
A[Input Sentences] --> B[SONAR Encoder]
B --> C[Concept Embeddings]
C --> D[PreNet]
D --> E[Transformer Decoder<br/>causal concept modeling]
E --> F[PostNet]
F --> G[Predicted Next Concept Embedding]
Training logic is in src/train.py.
- load text samples from a Hugging Face dataset,
- split text into sentences with spaCy,
- encode sentences into concept embeddings (
SonarEncoder), - add controlled noise to create targets,
- optimize MSE loss between predicted and target concept embeddings,
- save checkpoint to
saved_models/base_lcm_model.pth.
flowchart TD
A[HF Dataset] --> B[Sentence Split]
B --> C[SONAR Encoding]
C --> D[Input Embeddings]
D --> E[BaseLCM]
C --> F[Noise Injection]
F --> G[Target Embeddings]
E --> H[MSE Loss with Targets]
H --> I[Backprop + AdamW]
I --> J[Checkpoint Saved]
large_concept_model_tt/
├── img/
│ └── LCM.gif
├── src/
│ ├── base_lcm.py
│ ├── train.py
│ ├── test.py
│ └── utils.py
├── saved_models/
├── requirements.txt
└── README.md
Requirements:
- Python 3.12+
- PyTorch-compatible environment (GPU optional, recommended for speed)
Install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_smMinimal example:
PYTHONPATH=src python -m src.train --hf_data "wikitext" --text_column "text" --data_sample 1000 --epoch 3Extended example:
PYTHONPATH=src python -m src.train \
--hf_data "wikitext" \
--text_column "text" \
--lang "en" \
--batch_size 8 \
--sequence_length 10 \
--input_dim 256 \
--hidden_dim 512 \
--num_heads 8 \
--num_layers 6 \
--ff_dim 2048 \
--output_dim 256 \
--lr 0.001 \
--weight_decay 1e-4 \
--noise_level 0.05 \
--data_sample 1000Checkpoint output:
saved_models/base_lcm_model.pth
Run:
PYTHONPATH=src python src/test.pysrc/test.py will:
- load a saved checkpoint,
- infer model config from state dict shapes,
- encode a sample prompt into concept space,
- produce and print predicted concept embedding output.
Use this section as a lightweight experiment log for reproducibility.
| Run ID | Dataset | Samples | Epochs | Batch Size | LR | Final Train Loss | Notes |
|---|---|---|---|---|---|---|---|
| baseline-001 | wikitext | 1000 | 3 | 8 | 1e-3 | 0.0418 | first stable run |
| exp-002 | wikitext | 2000 | 5 | 8 | 5e-4 | 0.0297 | lower LR + more data improved stability |
| Input Prompt | Expected Semantic Direction | Predicted Output (short) | Comment |
|---|---|---|---|
| "What are the causes of climate change?" | science / environment explanation | fill from model output | check topical coherence |
| "How can I learn Python quickly?" | practical step-by-step advice | fill from model output | check actionable structure |
- Lower MSE generally indicates better embedding-space reconstruction.
- Also validate semantic quality manually (not just loss), since embedding-space fit does not always guarantee best text-level usefulness.
Enable in training with:
--wandb TrueThen authenticate once:
wandb login- Small model capacity: the ~5M parameter setup is fast, but may underfit complex long-range concept dynamics.
- No official benchmark parity: this repo is a compact educational/experimental implementation, not a reproduction of large-scale LCM training results.
- Embedding-space objective gap: MSE on embeddings may not perfectly align with downstream text generation quality.
- Single checkpoint workflow: training currently saves one primary checkpoint path (
saved_models/base_lcm_model.pth) without richer checkpoint management. - Device selection behavior: CUDA device handling in
src/train.pyis hardcoded (CUDA_VISIBLE_DEVICES="1,2,3"), which may not match all environments. - Data preprocessing simplicity: sentence splitting and sampling are basic; performance can vary significantly with better curation and preprocessing.
--hf_datais required for dataset loading.- Tune
--batch_sizeand--data_samplefor your memory budget. - The code currently sets
CUDA_VISIBLE_DEVICES="1,2,3"when CUDA is available. - This implementation is intentionally compact and optimized for speed and iteration, not for reproducing large-scale training runs.
