percolation-synthetic-data

Generates statistically self-similar synthetic datasets based on a percolation cluster model.

About

Ambitious mechanistic interpretability requires understanding the structure that neural networks uncover from data. A quantitative theoretical model of natural data's organizing structure could help AI safety researchers build interpretability tools that decompose neural networks along their natural scales of abstraction. This project works towards this goal by developing a synthetic data model that reproduces qualitative features of natural data. The model is based on high-dimensional percolation theory and describes statistically self-similar, sparse, and power-law-distributed data distributional structure.

This repository provides code to generate synthetic datasets based on this data model. In particular, it employs a newly developed algorithm to construct a dataset in a way that explicitly and iteratively reveals its innate hierarchical structure. Increasing the number of data points corresponds to representing the same dataset at a more fine-grained level of abstraction.

Percolation Theory

The branch of physics concerned with analyzing the properties of clusters of randomly occupied units on a lattice is called percolation theory (Stauffer & Aharony, 1994). In this framework, sites (or bonds) are occupied independently at random with probability $p$, and connected sites form clusters. While direct numerical simulation of percolation on a high-dimensional lattice is intractable due to the curse of dimensionality, the high-dimensional problem is exactly solvable analytically. Clusters are vanishingly unlikely to have loops, and the problem can be approximated by modeling the lattice as an infinite tree. This can be viewed as the mean-field approximation for percolation. In particular, percolation clusters on a high-dimensional lattice (at or above the upper critical dimension d >= 6) that are at or near criticality can be modeled using the Bethe lattice, an infinite treelike graph in which each node has identical degree z. For site or bond percolation on the Bethe lattice, the percolation threshold is p_c = 1/(z- 1). Using the Bethe lattice as an approximate model of a hypercubic lattice of dimension d gives z = 2*d and p_c = 1/(2*d - 1). A brief self-contained review based on standard references can be found in Brill (2025, App. A).

Algorithm

This repository implements an algorithm to simulate a data distribution modeled as a critical percolation cluster distribution on a large high-dimensional lattice, using an explicitly hierarchical approach. The algorithm consists of two stages. First, in the generation stage, a set of percolation clusters is generated. The generation stage produces a set of undirected, treelike graphs representing the clusters, and a forest of binary latent features that denote each point's membership in a cluster or subcluster. These graphs are generated using a cyclic coalescent algorithm that jointly samples a random tree and its hierarchical latent decomposition in almost linear time. Each point has an associated value that is a function of its latent subcluster membership features. Second, in the embedding stage, the graphs are embedded into a vector space following a branching random walk.

Usage

A synthetic dataset is generated using PercolationDataset.construct_embed(). Ground-truth latent features for each sample can be generated as a sparse matrix using GroundTruthFeatures.get_latent_features(). Because the dataset generation is stochastic, it is recommended to train and test models using multiple datasets generated with different random seeds. An example script to generate multiple datasets is provided in generate_data.py.

Datasets generated in the format of generate_data.py can be loaded with:

import numpy as np
from scipy import sparse

res = np.load("percolation_dataset_size<SIZE>_dim<DIM>_seed<SEED>.npz")
X, y = res['X'], res['y']
features = sparse.load_npz("percolation_dataset_size<SIZE>_dim<DIM>_seed<SEED>_gt_features.npz")

Caveats

This repository is under active development and subject to ongoing changes.
Because the data generation and embedding procedures are stochastic, any studies should be repeated using multiple datasets generated using different random seeds.
The embedding procedure relies on the statistical tendency of random vectors to be approximately orthogonal in high dimensions. An embedding dimension of O(100) or greater is recommended to avoid rare discrepancies between the nearest neighbors in the percolation graph structure and embedded data points.
A generated dataset represents a data distribution, i.e. the set of all possible data points that could theoretically be observed. To obtain a realistic analog of a machine learning dataset, only a tiny subset of a generated dataset should be used for training.

More information

This project is led by Ari Brill. Contact information is on my website.

This research program is part of the PIRAMID Project at Principles of Intelligence.

For more information on the percolation cluster model of data structure, see the papers:

Brill (2024), Neural Scaling Laws Rooted in the Data Distribution

Brill (2025), Representation Learning on a Random Lattice

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_data.py		generate_data.py
percolation_dataset.py		percolation_dataset.py
requirements.txt		requirements.txt
test_percolation_dataset.py		test_percolation_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

percolation-synthetic-data

About

Percolation Theory

Algorithm

Usage

Caveats

More information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

percolation-synthetic-data

About

Percolation Theory

Algorithm

Usage

Caveats

More information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages