Problem
The template's Dockerfile ships a single HEALTHCHECK that runs curl -f http://localhost:$PORT/health. That probe only succeeds for the webserver container, because only the webserver binds the port. Every other service started from the same image — scheduler, worker-system, any future TaskIQ workers — fails the probe by construction and ends up flagged unhealthy in docker compose ps even when working fine.
Symptoms reported downstream (aegis-pulse):
NAME STATUS
aegis-pulse-scheduler-1 Up 30 minutes (unhealthy)
aegis-pulse-worker-system-1 Up 31 minutes (unhealthy)
Container inspect shows the curl probe failing with Could not connect to server — exactly what you'd expect for a non-HTTP service.
The noisy unhealthy flag is the visible problem. The deeper one is that any future deploy gate, restart-on-unhealthy policy, or monitoring on docker compose ps reads false negatives from these containers and can't tell a real outage from the false alarm.
Workaround in aegis-pulse
docker-compose.yml overrides for scheduler + worker-system services:
healthcheck:
disable: true
Silences the false flag but loses container-level liveness signal entirely — scheduler/worker death is only noticed via missed cron output or queue depth.
Proposed fix (in aegis-stack template)
Per-service healthcheck overrides in the template's docker-compose.yml. Three options, ranked by community standardness:
- Tiny in-process HTTP
/health endpoint (most idiomatic) — wrap APScheduler + TaskIQ broker in a small ASGI app on 127.0.0.1. Returns 200 if scheduler.running == True (APScheduler's canonical liveness property; see flask-apscheduler). Catches deadlocked event loops because a stuck loop also stalls the HTTP handler. Marginal cost: ~12 MB if using Starlette+uvicorn, ~1 MB with stdlib asyncio.start_server.
- Heartbeat-file approach — service writes
/tmp/.alive periodically, healthcheck reads mtime. Works regardless of jobstore. ~10 LOC per service. Less idiomatic than HTTP.
- Disable per-service — what aegis-pulse does today. Honest, simple, no monitoring signal.
Worker-specific signal: TaskIQ broker reachability (broker.is_started / Redis ping). Same HTTP-endpoint shape.
Acceptance
- aegis-stack template ships per-service healthcheck overrides for scheduler + worker containers
- Downstream projects (aegis-pulse, future scaffolds) inherit a sane default — no false
unhealthy flag, no manual override needed
- A deadlocked scheduler or broker-disconnected worker correctly flips to
unhealthy within ~90s
Open
- Which of the three options the template should adopt as default (HTTP endpoint is the cleanest but adds a small surface; heartbeat file is friction-free)
- Whether to apply consistently across
worker-system, worker-load-test, and any future TaskIQ-style worker added via aegis-stack add-service
Problem
The template's
Dockerfileships a singleHEALTHCHECKthat runscurl -f http://localhost:$PORT/health. That probe only succeeds for the webserver container, because only the webserver binds the port. Every other service started from the same image —scheduler,worker-system, any future TaskIQ workers — fails the probe by construction and ends up flaggedunhealthyindocker compose pseven when working fine.Symptoms reported downstream (aegis-pulse):
Container
inspectshows the curl probe failing withCould not connect to server— exactly what you'd expect for a non-HTTP service.The noisy
unhealthyflag is the visible problem. The deeper one is that any future deploy gate, restart-on-unhealthy policy, or monitoring ondocker compose psreads false negatives from these containers and can't tell a real outage from the false alarm.Workaround in aegis-pulse
docker-compose.ymloverrides for scheduler + worker-system services:Silences the false flag but loses container-level liveness signal entirely — scheduler/worker death is only noticed via missed cron output or queue depth.
Proposed fix (in aegis-stack template)
Per-service healthcheck overrides in the template's
docker-compose.yml. Three options, ranked by community standardness:/healthendpoint (most idiomatic) — wrap APScheduler + TaskIQ broker in a small ASGI app on127.0.0.1. Returns 200 ifscheduler.running == True(APScheduler's canonical liveness property; see flask-apscheduler). Catches deadlocked event loops because a stuck loop also stalls the HTTP handler. Marginal cost: ~12 MB if using Starlette+uvicorn, ~1 MB with stdlibasyncio.start_server./tmp/.aliveperiodically, healthcheck reads mtime. Works regardless of jobstore. ~10 LOC per service. Less idiomatic than HTTP.Worker-specific signal: TaskIQ broker reachability (
broker.is_started/ Redis ping). Same HTTP-endpoint shape.Acceptance
unhealthyflag, no manual override neededunhealthywithin ~90sOpen
worker-system,worker-load-test, and any future TaskIQ-style worker added viaaegis-stack add-service