This repository contains an educational PyTorch implementation of the BDH-GPU architecture proposed in the paper:
A. Kosowski, P. Uznański, J. Chorowski, Z. Stamirowska, M. Bartoszkiewicz. The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain, arXiv (2025).
BDH is a novel Large Language Model architecture based on a scale-free, biologically-inspired network of locally-interacting neurons.
I find the paper particularly fascinating for its elegant synthesis of concepts from neuroscience, distributed computing, dynamical systems, and formal logic into a single, GPU-friendly architecture.
The model is trained on a pathfinding task: given an N×N board with obstacles, find the shortest path from START to END.
| Left Panel: Board Predictions | Right Panel: Neuron Dynamics (Gx = E @ Dx) |
|---|---|
|
The model's output refined layer by layer. Legend: FLOOR (white), WALL (black), START (red), END (green), PATH (gold) |
Signal flow through the learned "causal circuit" - the neuron-to-neuron connectivity graph. - Blue rings: Source neurons (yl−1) - Red fill: Destination neurons (xl) - Edge darkness: Signal flow, yl−1 × Gx × xl Activations are averaged across all board cells to produce one value per neuron. |
BDH's architecture enables direct visualization of its internal computation. The challenge is that inference relies on multiple superimposed topologies: fixed learned circuits (the model weights) and dynamic activations that change at inference time.
The model has 8,000+ neurons but for clarity I render only the hub subgraph selected by connectivity degree. Specifically: neurons are ranked by their degree in Gx (counting edges where |Gx[i,j]| > threshold), top candidates are selected, and small disconnected components are pruned. Remarkably, the sparse, modular organization you see is emergent. The model was not hard-coded to have hubs, but spontaneously organized itself this way from random initialization. This replicates the paper's empirical findings.
| Left Panel: Board Attention | Right Panel: Sparsity Dynamics |
|---|---|
|
The model's output refined layer by layer, with extra detail. - Blue arrows: top 30 strongest cell-to-cell attentions - Red dots: proportion of active neurons (x) per cell - PATH cells in gold, confidence shown via alpha |
Percentage of neurons active per layer. Red (x): ~20%, Blue (y): ~3-5% |
Blue arrows show attention initially radiating from START and END toward neighboring cells. As the path extends from both endpoints, attention shifts to the newly predicted cells, flowing outward to discover the remaining route until the path connects in the middle.
Red dots show more neurons firing at START, END, and WALL, with PATH cells activating progressively as predictions solidify.
The chart confirms that y activations are indeed very sparse throughout inference.
The BDH architecture introduces several design choices that distinguish it from conventional Transformers and enable the causal interpretability shown above.
- Neuron-Centric Scaling: The model scales primarily in the high-dimensional Neuron dimension (n), rather than the dense latent dimension of Transformers. Parameters and state are localized to specific neuron pairs, mirroring biological structure.
- Fixed Topologies as "Learned Programs": The model weights define sparse, scale-free graphs that act as the system's fixed ruleset:
- The Causal Circuit (
Gx = E @ Dx): Implements signal propagation from y to x - a probabilistic form of Modus Ponens reasoning ("If concept A is active, trigger concept B"). The paper calls these the "wires". - The Output Circuit (
Gy = Dy @ E): Determines which neurons (y) should fire based on the attention-weighted context. The paper calls these the "prods".
- The Causal Circuit (
- Dynamic Synaptic State (Edge-Reweighting): Instead of a vector-based KV-cache, the model maintains "fast weights" on the edges between neurons (matrix σ). This state is updated via a Hebbian Learning rule ("neurons that fire together, wire together"), allowing the model to dynamically re-weight its own reasoning circuits over the duration of the context.
- Sparse & Positive Activations: The architecture enforces all activation vectors to be strictly positive and sparse. As noted in the paper, y activations are observed to be "extremely sparse" in practice (~3-5%). This design prevents the polysemantic "superposition" common in dense models, effectively filtering noise and isolating distinct logical paths.
pip install -r requirements.txtTo train a new model from scratch, run:
python3 boardpath.py --mode trainOptional: You can ensure reproducibility by setting a fixed random seed:
python3 boardpath.py --mode train --seed 42The trained model will be saved to boardpath.pt.
To load a trained model and run it on a randomly generated board:
python3 boardpath.py --mode inferenceOptional: If you have a specific checkpoint file you wish to load:
python3 boardpath.py --mode inference --model my_model.ptThis will print the input, target, and predicted boards to the console and generate visualizations:
combined_board_neuron.gif: Board predictions + Neuron dynamics (shown in demo above)combined_attention_sparsity.gif: Board attention + Sparsity animation (shown in demo above)sparsity_chart.png: Static sparsity summary
To adjust the model architecture or task parameters (e.g., board size, number of neurons), edit the get_config() function in boardpath.py.