Thanks to visit codestin.com
Credit goes to github.com

Skip to content

boettiger-lab/data-workflows

Repository files navigation

Data Workflows

Processing workflows for producing cloud-native geospatial datasets on the NRP Nautilus Kubernetes cluster.

Published Datasets

Browse the full catalog in STAC Browser:

radiantearth.github.io/stac-browser → Boettiger Lab Datasets

Datasets are hosted on NRP Nautilus S3 storage (s3-west.nrp-nautilus.io).

This repo contains no code — just configuration (k8s YAML), documentation (STAC metadata), and instructions. All processing is done by the cng-datasets CLI tool running inside Kubernetes pods.

How It Works

  1. You run cng-datasets workflow on your laptop — it generates Kubernetes Job YAML files
  2. You kubectl apply those files — the cluster does all the processing
  3. Outputs land on S3: GeoParquet, PMTiles, and H3-indexed hex parquet

You never process data locally. Your laptop just generates YAML and talks to kubectl.

Quick Start

# Install the CLI (one-time)
pip install cng-datasets

# Generate a processing pipeline for a dataset
cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gdb \
  --bucket public-mydata \
  --layer MyLayer \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --hex-memory 32Gi \
  --max-completions 200 \
  --max-parallelism 50 \
  --output-dir catalog/mydata/k8s/mylayer

# One-time RBAC setup (only needed once per cluster/namespace, likely already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml

# Apply workflow (per dataset)
kubectl apply -f catalog/mydata/k8s/mylayer/configmap.yaml \
              -f catalog/mydata/k8s/mylayer/workflow.yaml

# Monitor
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflow

That's it. The workflow orchestrates: bucket setup → convert to GeoParquet → PMTiles + H3 hex (parallel) → repartition.

Detailed Instructions

Repository Structure

catalog/
  <dataset>/
    k8s/           # Generated Kubernetes job YAML
    stac/          # README.md and stac-collection.json for the dataset
    *.ipynb        # Any exploratory notebooks (optional)

Each dataset gets a directory under catalog/. The k8s YAML is generated by cng-datasets workflow and applied with kubectl. STAC metadata is created after processing completes.

CLI Reference

See the cng-datasets README for full CLI documentation.

Key commands:

Command What it does Where it runs
cng-datasets workflow Generates k8s job YAML Your laptop
kubectl apply -f ... Submits jobs to the cluster Your laptop
kubectl get jobs Monitors job status Your laptop
Everything else Processing, S3 uploads, etc. Kubernetes pods

Sync, Backup & Public Mirror

NRP S3 is canonical. Two off-NRP destinations, driven by k8s Jobs under catalog/sync/:

Destination Purpose Scope Mechanism
MinIO (minio.carlboettiger.info) private backup of every public bucket (and the only off-NRP copy for license-restricted data) all public-* per-bucket Jobs catalog/sync/k8s/sync-public-*.yaml
Source Cooperative (source.coop) public mirror for discoverability catalogued and license-clear datasets only weekly source-sync CronJob (+ per-bucket source-sync-*.yaml for backfill)
# MinIO backup (private)
kubectl apply -f catalog/sync/k8s/sync-public-census.yaml      # one bucket
kubectl apply -f catalog/sync/k8s/                             # all

# source.coop public mirror — see the campaign docs first (scope is license-gated)
catalog/sync/source-coop/dry-run-local.sh census               # preview (no writes)
catalog/sync/source-coop/run-source-sync.sh census             # one repo (manual/backfill)
kubectl -n biodiversity create job --from=cronjob/source-sync source-sync-manual  # run the weekly job now

kubectl get jobs | grep -E 'sync-public|source-sync'           # monitor

Each job runs rclone sync with bandwidth throttling. When adding a new NRP bucket, add a MinIO sync job (see AGENTS.md Step 7).

source.coop is a standing automated mirror, not a one-shot. A weekly source-sync CronJob (Sundays 08:00 UTC, catalog/sync/k8s/source-sync-cron.yaml) re-syncs every in-scope repo and then re-applies the Phase 2 STAC href-rewrite (rewrite-stac-hrefs.py, which repoints mirrored STAC hrefs to data.source.coop/…; a plain sync would clobber it). It does not auto-discover new NRP buckets — scope is the generated source-sync-scope ConfigMap. Adding a bucket means following the add-a-repo loop in the campaign README, including manually creating the cboettig/<repo> product in the source.coop web UI first (the create API is disabled); a repo added to scope before its product exists fails its own sync visibly (continue-on-error) rather than auto-creating anything.

source.coop mirror — read before touching it: catalog/sync/source-coop/README.md is the campaign plan (scope policy, the add-a-repo loop, the weekly CronJob, account-wide-credentials safety, the Phase 2 STAC rewrite). Scope is the REPOS/MODE/EXCLUDES arrays in gen-source-sync.sh; per-collection license verdicts are in license-inventory.md. Some datasets cannot be mirrored (license forbids redistribution — WDPA/IUCN/ICCA/HydroBASINS) and stay MinIO-only. Tracking + status: issue #158.

Infrastructure

  • Cluster: NRP Nautilus, namespace biodiversity
  • S3: Ceph object storage (S3-compatible, not AWS)
  • Public endpoint: https://s3-west.nrp-nautilus.io/<bucket>/<path>
  • MinIO backup: minio.carlboettiger.info (private backup; synced via catalog/sync/k8s/sync-public-*.yaml)
  • Source Cooperative: us-west-2.opendata.source.coop/cboettig/<repo> (public mirror, license-gated; see catalog/sync/source-coop/)
  • Secrets: aws and rclone-config are pre-configured in the namespace (rclone-config has nrp, minio, and source remotes)

See .github/copilot-instructions.md for detailed infrastructure context.

About

Dataset processing workflows for cloud-native geospatial data on NRP Kubernetes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors