Generates statistically self-similar synthetic datasets based on a percolation cluster model.
Ambitious mechanistic interpretability requires understanding the structure that neural networks uncover from data. A quantitative theoretical model of natural data's organizing structure could help AI safety researchers build interpretability tools that decompose neural networks along their natural scales of abstraction. This project works towards this goal by developing a synthetic data model that reproduces qualitative features of natural data. The model is based on high-dimensional percolation theory and describes statistically self-similar, sparse, and power-law-distributed data distributional structure.
This repository provides code to generate synthetic datasets based on this data model. In particular, it employs a newly developed algorithm to construct a dataset in a way that explicitly and iteratively reveals its innate hierarchical structure. Increasing the number of data points corresponds to representing the same dataset at a more fine-grained level of abstraction.
The branch of physics concerned with analyzing the properties of clusters of randomly occupied units on a lattice is called percolation theory (Stauffer & Aharony, 1994). In this framework, sites (or bonds) are occupied independently at random with probability
This repository implements an algorithm to simulate a data distribution modeled as a critical percolation cluster distribution on a large high-dimensional lattice, using an explicitly hierarchical approach. The algorithm consists of two stages. First, in the generation stage, a set of percolation clusters is generated. The generation stage produces a set of undirected, treelike graphs representing the clusters, and a forest of binary latent features that denote each point's membership in a cluster or subcluster. These graphs are generated using a cyclic coalescent algorithm that jointly samples a random tree and its hierarchical latent decomposition in almost linear time. Each point has an associated value that is a function of its latent subcluster membership features. Second, in the embedding stage, the graphs are embedded into a vector space following a branching random walk.
A synthetic dataset is generated using PercolationDataset.construct_embed(). Ground-truth latent features for each sample can be generated as a sparse matrix using GroundTruthFeatures.get_latent_features(). Because the dataset generation is stochastic, it is recommended to train and test models using multiple datasets generated with different random seeds. An example script to generate multiple datasets is provided in generate_data.py.
Datasets generated in the format of generate_data.py can be loaded with:
import numpy as np
from scipy import sparse
res = np.load("percolation_dataset_size<SIZE>_dim<DIM>_seed<SEED>.npz")
X, y = res['X'], res['y']
features = sparse.load_npz("percolation_dataset_size<SIZE>_dim<DIM>_seed<SEED>_gt_features.npz")- This repository is under active development and subject to ongoing changes.
- Because the data generation and embedding procedures are stochastic, any studies should be repeated using multiple datasets generated using different random seeds.
- The embedding procedure relies on the statistical tendency of random vectors to be approximately orthogonal in high dimensions. An embedding dimension of O(100) or greater is recommended to avoid rare discrepancies between the nearest neighbors in the percolation graph structure and embedded data points.
- A generated dataset represents a data distribution, i.e. the set of all possible data points that could theoretically be observed. To obtain a realistic analog of a machine learning dataset, only a tiny subset of a generated dataset should be used for training.
This project is led by Ari Brill. Contact information is on my website.
This research program is part of the PIRAMID Project at Principles of Intelligence.
For more information on the percolation cluster model of data structure, see the papers:
Brill (2024), Neural Scaling Laws Rooted in the Data Distribution