Per-service healthcheck for scheduler + workers (Dockerfile HEALTHCHECK is webserver-shaped)

## Problem

The template's `Dockerfile` ships a single `HEALTHCHECK` that runs `curl -f http://localhost:$PORT/health`. That probe only succeeds for the webserver container, because only the webserver binds the port. Every other service started from the same image — `scheduler`, `worker-system`, any future TaskIQ workers — fails the probe by construction and ends up flagged `unhealthy` in `docker compose ps` even when working fine.

Symptoms reported downstream (aegis-pulse):

```
NAME                          STATUS
aegis-pulse-scheduler-1       Up 30 minutes (unhealthy)
aegis-pulse-worker-system-1   Up 31 minutes (unhealthy)
```

Container `inspect` shows the curl probe failing with `Could not connect to server` — exactly what you'd expect for a non-HTTP service.

The noisy `unhealthy` flag is the visible problem. The deeper one is that any future deploy gate, restart-on-unhealthy policy, or monitoring on `docker compose ps` reads false negatives from these containers and can't tell a real outage from the false alarm.

## Workaround in aegis-pulse

`docker-compose.yml` overrides for scheduler + worker-system services:

```yaml
healthcheck:
  disable: true
```

Silences the false flag but loses container-level liveness signal entirely — scheduler/worker death is only noticed via missed cron output or queue depth.

## Proposed fix (in aegis-stack template)

Per-service healthcheck overrides in the template's `docker-compose.yml`. Three options, ranked by community standardness:

1. **Tiny in-process HTTP `/health` endpoint** (most idiomatic) — wrap APScheduler + TaskIQ broker in a small ASGI app on `127.0.0.1`. Returns 200 if `scheduler.running == True` (APScheduler's canonical liveness property; see [flask-apscheduler](https://github.com/viniciuschiele/flask-apscheduler/blob/master/flask_apscheduler/scheduler.py)). Catches deadlocked event loops because a stuck loop also stalls the HTTP handler. Marginal cost: ~12 MB if using Starlette+uvicorn, ~1 MB with stdlib `asyncio.start_server`.
2. **Heartbeat-file approach** — service writes `/tmp/.alive` periodically, healthcheck reads mtime. Works regardless of jobstore. ~10 LOC per service. Less idiomatic than HTTP.
3. **Disable per-service** — what aegis-pulse does today. Honest, simple, no monitoring signal.

Worker-specific signal: TaskIQ broker reachability (`broker.is_started` / Redis ping). Same HTTP-endpoint shape.

## Acceptance

- aegis-stack template ships per-service healthcheck overrides for scheduler + worker containers
- Downstream projects (aegis-pulse, future scaffolds) inherit a sane default — no false `unhealthy` flag, no manual override needed
- A deadlocked scheduler or broker-disconnected worker correctly flips to `unhealthy` within ~90s

## Open

- Which of the three options the template should adopt as default (HTTP endpoint is the cleanest but adds a small surface; heartbeat file is friction-free)
- Whether to apply consistently across `worker-system`, `worker-load-test`, and any future TaskIQ-style worker added via `aegis-stack add-service`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-service healthcheck for scheduler + workers (Dockerfile HEALTHCHECK is webserver-shaped) #706

Problem

Workaround in aegis-pulse

Proposed fix (in aegis-stack template)

Acceptance

Open

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Per-service healthcheck for scheduler + workers (Dockerfile HEALTHCHECK is webserver-shaped) #706

Description

Problem

Workaround in aegis-pulse

Proposed fix (in aegis-stack template)

Acceptance

Open

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions