Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.
Browse the full catalog in STAC Browser:
radiantearth.github.io/stac-browser → Boettiger Lab Datasets
Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).
This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.
- You run
cng-datasets workflowon your laptop — it generates Kubernetes Job YAML files - You
kubectl applythose files — the cluster does all the processing - Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet
You never process data locally. Your laptop just generates YAML and talks to kubectl.
# Install the CLI (one-time)
pip install cng-datasets
# Generate a processing pipeline for a dataset
cng-datasets workflow \
--dataset my-dataset \
--source-url https://example.com/data.gdb \
--bucket public-mydata \
--layer MyLayer \
--h3-resolution 10 \
--parent-resolutions "9,8,0" \
--hex-memory 32Gi \
--max-completions 200 \
--max-parallelism 50 \
--output-dir catalog/mydata/k8s/mylayer
# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml
# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
-f catalog/mydata/k8s/mylayer/workflow.yaml
# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflowThat's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.
- AGENTS.md — Complete step-by-step guide for processing datasets (for humans and LLM agents)
- DATASET_DOCUMENTATION_WORKFLOW.md — How to create README and STAC metadata after processing
- todo.md — Tracking status of all datasets
catalog/
<dataset>/
k8s/ # Generated Kubernetes job YAML
stac/ # README.md and stac-collection.json for the dataset
*.ipynb # Any exploratory notebooks (optional)
Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.
See the cng-datasets README for full CLI documentation.
Key commands:
| Command | What it does | Where it runs |
|---|---|---|
cng-datasets workflow |
Generates k8s job YAML | Your laptop |
kubectl apply -f ... |
Submits jobs to the cluster | Your laptop |
kubectl get jobs |
Monitors job status | Your laptop |
| Everything else | Processing, S3 uploads, etc. | Kubernetes pods |
NRP S3 is canonical. Two off-NRP destinations, driven by k8s Jobs under catalog/sync/:
| Destination | Purpose | Scope | Mechanism |
|---|---|---|---|
MinIO (minio.carlboettiger.info) |
private backup of every public bucket (and the only off-NRP copy for license-restricted data) | all public-* |
per-bucket Jobs catalog/sync/k8s/sync-public-*.yaml |
Source Cooperative (source.coop) |
public mirror for discoverability | catalogued and license-clear datasets only | weekly source-sync CronJob (+ per-bucket source-sync-*.yaml for backfill) |
# MinIO backup (private)
kubectl apply -f catalog/sync/k8s/sync-public-census.yaml # one bucket
kubectl apply -f catalog/sync/k8s/ # all
# source.coop public mirror — see the campaign docs first (scope is license-gated)
catalog/sync/source-coop/dry-run-local.sh census # preview (no writes)
catalog/sync/source-coop/run-source-sync.sh census # one repo (manual/backfill)
kubectl -n biodiversity create job --from=cronjob/source-sync source-sync-manual # run the weekly job now
kubectl get jobs | grep -E 'sync-public|source-sync' # monitorEach job runs rclone sync with bandwidth throttling. When adding a new NRP bucket, add a MinIO sync job (see AGENTS.md Step 7).
source.coop is a standing automated mirror, not a one-shot. A weekly source-sync CronJob (Sundays 08:00 UTC, catalog/sync/k8s/source-sync-cron.yaml) re-syncs every in-scope repo and then re-applies the Phase 2 STAC href-rewrite (rewrite-stac-hrefs.py, which repoints mirrored STAC hrefs to data.source.coop/…; a plain sync would clobber it). It does not auto-discover new NRP buckets — scope is the generated source-sync-scope ConfigMap. Adding a bucket means following the add-a-repo loop in the campaign README, including manually creating the cboettig/<repo> product in the source.coop web UI first (the create API is disabled); a repo added to scope before its product exists fails its own sync visibly (continue-on-error) rather than auto-creating anything.
source.coop mirror — read before touching it: catalog/sync/source-coop/README.md is the campaign plan (scope policy, the add-a-repo loop, the weekly CronJob, account-wide-credentials safety, the Phase 2 STAC rewrite). Scope is the REPOS/MODE/EXCLUDES arrays in gen-source-sync.sh; per-collection license verdicts are in license-inventory.md. Some datasets cannot be mirrored (license forbids redistribution — WDPA/IUCN/ICCA/HydroBASINS) and stay MinIO-only. Tracking + status: issue #158.
- Cluster: NRP Nautilus, namespace
biodiversity - S3: Ceph object storage (S3-compatible, not AWS)
- Public endpoint:
https://s3-west.nrp-nautilus.io/<bucket>/<path> - MinIO backup:
minio.carlboettiger.info(private backup; synced viacatalog/sync/k8s/sync-public-*.yaml) - Source Cooperative:
us-west-2.opendata.source.coop/cboettig/<repo>(public mirror, license-gated; seecatalog/sync/source-coop/) - Secrets:
awsandrclone-configare pre-configured in the namespace (rclone-confighasnrp,minio, andsourceremotes)
See .github/copilot-instructions.md for detailed infrastructure context.