Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Per-service healthcheck for scheduler + workers (Dockerfile HEALTHCHECK is webserver-shaped) #706

@lbedner

Description

@lbedner

Problem

The template's Dockerfile ships a single HEALTHCHECK that runs curl -f http://localhost:$PORT/health. That probe only succeeds for the webserver container, because only the webserver binds the port. Every other service started from the same image — scheduler, worker-system, any future TaskIQ workers — fails the probe by construction and ends up flagged unhealthy in docker compose ps even when working fine.

Symptoms reported downstream (aegis-pulse):

NAME                          STATUS
aegis-pulse-scheduler-1       Up 30 minutes (unhealthy)
aegis-pulse-worker-system-1   Up 31 minutes (unhealthy)

Container inspect shows the curl probe failing with Could not connect to server — exactly what you'd expect for a non-HTTP service.

The noisy unhealthy flag is the visible problem. The deeper one is that any future deploy gate, restart-on-unhealthy policy, or monitoring on docker compose ps reads false negatives from these containers and can't tell a real outage from the false alarm.

Workaround in aegis-pulse

docker-compose.yml overrides for scheduler + worker-system services:

healthcheck:
  disable: true

Silences the false flag but loses container-level liveness signal entirely — scheduler/worker death is only noticed via missed cron output or queue depth.

Proposed fix (in aegis-stack template)

Per-service healthcheck overrides in the template's docker-compose.yml. Three options, ranked by community standardness:

  1. Tiny in-process HTTP /health endpoint (most idiomatic) — wrap APScheduler + TaskIQ broker in a small ASGI app on 127.0.0.1. Returns 200 if scheduler.running == True (APScheduler's canonical liveness property; see flask-apscheduler). Catches deadlocked event loops because a stuck loop also stalls the HTTP handler. Marginal cost: ~12 MB if using Starlette+uvicorn, ~1 MB with stdlib asyncio.start_server.
  2. Heartbeat-file approach — service writes /tmp/.alive periodically, healthcheck reads mtime. Works regardless of jobstore. ~10 LOC per service. Less idiomatic than HTTP.
  3. Disable per-service — what aegis-pulse does today. Honest, simple, no monitoring signal.

Worker-specific signal: TaskIQ broker reachability (broker.is_started / Redis ping). Same HTTP-endpoint shape.

Acceptance

  • aegis-stack template ships per-service healthcheck overrides for scheduler + worker containers
  • Downstream projects (aegis-pulse, future scaffolds) inherit a sane default — no false unhealthy flag, no manual override needed
  • A deadlocked scheduler or broker-disconnected worker correctly flips to unhealthy within ~90s

Open

  • Which of the three options the template should adopt as default (HTTP endpoint is the cleanest but adds a small surface; heartbeat file is friction-free)
  • Whether to apply consistently across worker-system, worker-load-test, and any future TaskIQ-style worker added via aegis-stack add-service

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions