Data Workflows

Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.

Published Datasets

Browse the full catalog in STAC Browser:

radiantearth.github.io/stac-browser → Boettiger Lab Datasets

Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).

This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.

How It Works

You run cng-datasets workflow on your laptop — it generates Kubernetes Job YAML files
You kubectl apply those files — the cluster does all the processing
Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet

You never process data locally. Your laptop just generates YAML and talks to kubectl.

Quick Start

# Install the CLI (one-time)
pip install cng-datasets

# Generate a processing pipeline for a dataset
cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gdb \
  --bucket public-mydata \
  --layer MyLayer \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 32Gi \
  --max-completions 200 \
  --max-parallelism 50 \
  --output-dir catalog/mydata/k8s/mylayer

# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml

# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
              -f catalog/mydata/k8s/mylayer/workflow.yaml

# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflow

That's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.

Detailed Instructions

AGENTS.md — Complete step-by-step guide for processing datasets (for humans and LLM agents)
DATASET_DOCUMENTATION_WORKFLOW.md — How to create README and STAC metadata after processing
todo.md — Tracking status of all datasets

Repository Structure

catalog/
  <dataset>/
    k8s/           # Generated Kubernetes job YAML
    stac/          # README.md and stac-collection.json for the dataset
    *.ipynb        # Any exploratory notebooks (optional)

Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.

CLI Reference

See the cng-datasets README for full CLI documentation.

Key commands:

Command	What it does	Where it runs
`cng-datasets workflow`	Generates k8s job YAML	Your laptop
`kubectl apply -f ...`	Submits jobs to the cluster	Your laptop
`kubectl get jobs`	Monitors job status	Your laptop
Everything else	Processing, S3 uploads, etc.	Kubernetes pods

Sync, Backup & Public Mirror

NRP S3 is canonical. Two off-NRP destinations, driven by k8s Jobs under catalog/sync/:

Destination	Purpose	Scope	Mechanism
MinIO (`minio.carlboettiger.info`)	private backup of every public bucket (and the only off-NRP copy for license-restricted data)	all `public-*`	per-bucket Jobs `catalog/sync/k8s/sync-public-*.yaml`
Source Cooperative (`source.coop`)	public mirror for discoverability	catalogued and license-clear datasets only	weekly `source-sync` CronJob (+ per-bucket `source-sync-*.yaml` for backfill)

# MinIO backup (private)
kubectl apply -f catalog/sync/k8s/sync-public-census.yaml      # one bucket
kubectl apply -f catalog/sync/k8s/                             # all

# source.coop public mirror — see the campaign docs first (scope is license-gated)
catalog/sync/source-coop/dry-run-local.sh census               # preview (no writes)
catalog/sync/source-coop/run-source-sync.sh census             # one repo (manual/backfill)
kubectl -n biodiversity create job --from=cronjob/source-sync source-sync-manual  # run the weekly job now

kubectl get jobs | grep -E 'sync-public|source-sync'           # monitor

Each job runs rclone sync with bandwidth throttling. When adding a new NRP bucket, add a MinIO sync job (see AGENTS.md Step 7).

source.coop is a standing automated mirror, not a one-shot. A weekly source-sync CronJob (Sundays 08:00 UTC, catalog/sync/k8s/source-sync-cron.yaml) re-syncs every in-scope repo and then re-applies the Phase 2 STAC href-rewrite (rewrite-stac-hrefs.py, which repoints mirrored STAC hrefs to data.source.coop/…; a plain sync would clobber it). It does not auto-discover new NRP buckets — scope is the generated source-sync-scope ConfigMap. Adding a bucket means following the add-a-repo loop in the campaign README, including manually creating the cboettig/<repo> product in the source.coop web UI first (the create API is disabled); a repo added to scope before its product exists fails its own sync visibly (continue-on-error) rather than auto-creating anything.

source.coop mirror — read before touching it: catalog/sync/source-coop/README.md is the campaign plan (scope policy, the add-a-repo loop, the weekly CronJob, account-wide-credentials safety, the Phase 2 STAC rewrite). Scope is the REPOS/MODE/EXCLUDES arrays in gen-source-sync.sh; per-collection license verdicts are in license-inventory.md. Some datasets cannot be mirrored (license forbids redistribution — WDPA/IUCN/ICCA/HydroBASINS) and stay MinIO-only. Tracking + status: issue #158.

Infrastructure

Cluster: NRP Nautilus, namespace biodiversity
S3: Ceph object storage (S3-compatible, not AWS)
Public endpoint: https://s3-west.nrp-nautilus.io/<bucket>/<path>
MinIO backup: minio.carlboettiger.info (private backup; synced via catalog/sync/k8s/sync-public-*.yaml)
Source Cooperative: us-west-2.opendata.source.coop/cboettig/<repo> (public mirror, license-gated; see catalog/sync/source-coop/)
Secrets: aws and rclone-config are pre-configured in the namespace (rclone-config has nrp, minio, and source remotes)

See .github/copilot-instructions.md for detailed infrastructure context.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.claude		.claude
.github		.github
catalog		catalog
dataset-requests		dataset-requests
docs		docs
scripts		scripts
stac		stac
.gitignore		.gitignore
.mcp.json		.mcp.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
catalog.md		catalog.md
ecosystem.md		ecosystem.md
features.md		features.md
how-it-works.md		how-it-works.md
index.md		index.md
myst.yml		myst.yml
quickstart.md		quickstart.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Workflows

Published Datasets

How It Works

Quick Start

Detailed Instructions

Repository Structure

CLI Reference

Sync, Backup & Public Mirror

Infrastructure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Workflows

Published Datasets

How It Works

Quick Start

Detailed Instructions

Repository Structure

CLI Reference

Sync, Backup & Public Mirror

Infrastructure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages