ARC Lang is an asynchronous pipeline for tackling Abstraction and Reasoning Corpus (ARC) puzzles with language models. It iteratively prompts models to write instructions, tests those instructions on the training grids, revises the best ideas, and finally applies the strongest instructions to produce two candidate outputs for each test grid.
- Dataset loading –
src/run.pyparses ARC challenge JSON files (seedata/arc-prize-20XX/). Challenges are processed in batches with a monitored semaphore so multiple tasks can run in parallel without exceeding API limits. - Instruction generation – For each
Stepin aRunConfig,get_instruction_scoresprompts an LLM (defined bystep.instruction_model) with the training grids viasrc/main.py. Each response is scored by leave-one-out cross validation: the instructions are applied to every training example usingoutput_grid_from_instructions, which is another LLM call that follows the instructions. - Scoring –
score_instructions_on_challengerecords per-example results, calculates a simple cell-wise similarity score, writes attempts to Postgres ifNEON_DSNis set, and keeps the top instructions in memory. - Revision and pooling –
StepRevisionasks the model to repair its own instructions using a rich feedback prompt that highlights wrong outputs.StepRevisionPoolsynthesizes a new plan from the best previous instructions and their scores. Both feed back into the scoring loop. - Final predictions –
return_answerreplays the strongest instructions withfinal_follow_modelto generate multiple outputs per test grid. The system picks up to two diverse guesses per grid and writes them toattempts/arc-prize-20XX/.... If ground-truth solutions are supplied,evaluate_solutionscomputes accuracy; otherwise the guesses are ready for competition submission.
src/run.py– async entry point and orchestration of the entire solve loop.src/main.py– prompt builders for instruction creation, revision, and grid execution.src/configs/– ready-to-useRunConfigpresets (grok_config_prod,mini_config,oss_config, etc.).src/llms/– provider wrappers and structured output helpers (get_next_structure).src/models.py– Pydantic models for ARC challenges, helper utilities, and visualization support.src/async_utils/semaphore_monitor.py– concurrency guard that logs semaphore saturation.attempts/– JSON outputs for submissions plustemp_solutions/scratch files written during a run.data/– ARC datasets (training, evaluation, and ground-truth solutions where available).
- Python 3.12+ (project targets 3.12 via Ruff configuration).
- Access tokens for the model providers you intend to use. The default config uses xAI Grok, but other presets rely on OpenAI, Anthropic, Gemini, DeepSeek, or OpenRouter.
MAX_CONCURRENCYenvironment variable – required; sets the global API semaphore insidesrc/llms/structured.py.
Install dependencies with either uv or pip:
uv syncEnvironment variables are loaded automatically from a .env file courtesy of python-dotenv inside src/logging_config.py. A typical configuration looks like:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...
DEEPSEEK_API_KEY=...
OPENROUTER_API_KEY=...
XAI_API_KEY=key1,key2 # multiple keys allowed for grok
MAX_CONCURRENCY=20
LOGFIRE_API_KEY=... # optional remote logging
LOCAL_LOGS_ONLY=1 # skip sending logs upstream
LOG_LEVEL=INFO
NEON_DSN=postgresql://... # optional; enables result persistence
USE_TASK_ID=0 # set to 1 to send true task ids to logs/LLMs
VIZ=0 # set to 1 to open matplotlib previews during scoring
LOG_GRIDS=0 # set to 1 to log mismatched grids verboselyOnly the API keys for the models you select and MAX_CONCURRENCY are strictly required. If a variable is missing the related feature simply falls back (for example, no Postgres writes when NEON_DSN is unset).
The main entry point is src/run.py. It currently wires up the 2025 evaluation challenges and defaults to grok_config_prod with limit=1 so you can smoke-test the pipeline quickly.
python src/run.pyWhat happens:
- A run id is generated (
logging_config.generate_run_id) and stored in log context. - The evaluation challenges JSON is loaded, optionally filtered by
limit,offset, ortask_ids. - Each challenge is solved asynchronously via
solve_challenges, constrained byconfig.max_concurrent_tasks. - Intermediate guesses are written to
attempts/arc-prize-2025/and mirrored undertemp_solutions/per task. - If
solutions_pathpoints to ground truth, final accuracy is printed; otherwise you can submit the attempt JSON directly.
To run on more tasks, adjust the call inside run()—for example set limit=None to sweep the full evaluation set or swap in another config:
await run_from_json(
challenges_path=challenges_path,
truth_solutions_path=None, # disable automatic scoring
config=mini_config, # use the faster preset
attempts_path=attempts_path,
temp_attempts_dir=temp_attempts_path,
limit=None,
offset=0,
)You can also call run_from_json from your own script, supplying custom JSON paths or a bespoke RunConfig instance.
src/configs/models.py defines three step types:
Step– generatetimesinstruction candidates withinstruction_model, score them by following the instructions withfollow_model.StepRevision– take the toptop_scores_usedcandidates, ask the LLM to revise eachtimes_per_top_scoretimes, then rescore using the revision’sfollow_model.StepRevisionPool– build a synthesis prompt that shows multiple instruction sets, including per-example scores, and request a brand-new instructiontimestimes.
A RunConfig bundles a sequence of these steps plus:
final_follow_modelandfinal_follow_timesfor the last pass when we produce answers for the hidden test grids.max_concurrent_tasksto bound how many challenges are solved at once.
By editing or creating a RunConfig you control which providers are called, whether images or diff notations are included, and how aggressively the system revises its plans. The presets in src/configs/ illustrate lightweight (mini_config), production Grok (grok_config_prod), GPT-5 (gpt_config_prod), and fully open-source (oss_config) strategies.
- Attempt JSON:
attempts/arc-prize-20XX/arc-agi_<split>_attempts.jsoncontains both guesses per task in the format expected by ARC competition submissions. - Checkpoint files:
attempts/.../temp_solutions/<task_id>.jsonholds the current task’s guesses so you can inspect intermediate results. - Optional database writes: when
NEON_DSNis present, eachInstructionsScoreand finalGuessis inserted into Postgres for analysis. - Local logs:
logs/arc.logreceives structured spans and key-value metadata. Remote logfire emission is enabled unlessLOCAL_LOGS_ONLY=1.
- Set
VIZ=1to open matplotlib comparisons whenever a training grid prediction differs from the target (requires a display or X forwarding). - Toggle
LOG_GRIDS=1to dump expected vs. actual grids in logs. generate_grid_diff(exposed insrc/run.py) can be reused in notebooks to produce ASCII diffs of grid pairs.- Use the
attempts/.../temp_solutionsfiles to rapidly inspect what the model produced for each test grid.
- Ruff and mypy configs live in
pyproject.toml. Runuv run ruff check srcoruv run mypy srcto lint/type-check. - The project is asyncio-first; if you extend it, prefer async functions and reuse
MonitoredSemaphoreto avoid overload. src/llms/structured.pycentralizes provider-specific settings (retries, pricing metadata, structured output formats). Extend this module if you add a new model family.
- Missing
MAX_CONCURRENCYwill raise aKeyErroron import—define it before running. - Authentication errors typically surface inside
get_next_structure; double-check API keys and provider quotas. - If you see repeated
retry_failedlogs, the provider may be rate-limiting you—lowermax_concurrent_tasksorMAX_CONCURRENCY. - When running on headless servers, keep
VIZ=0to avoid matplotlib backend errors.
With the README and code as reference, you can adapt ARC Lang to new model providers, tweak scoring heuristics, or plug in alternative instruction synthesis strategies.