(temporary public repo - remove after 31 october 2025)
This document specifies a scalable Node.js/TypeScript service that consumes the Auriga Flight Comparison API to return optimal flight solutions between airports. The system handles 250k-1M daily users through intelligent caching (85%+ hit rate), horizontal scaling, and cost-effective API usage patterns. Core focus: price-based optimization with extensible scoring for future enhancements.
- Provider API limitations: No departure/arrival times or durations exposed. "Fastest" mode deferred to Phase 2 pending schedule data integration.
- API pricing model: Assumed pay-per-call; targeting <15% of requests hitting upstream via aggressive caching.
- Update frequency: Flight data assumed to refresh every 15-30 minutes; reference data (airports/airlines) daily.
- Currency: All prices in EUR (to be confirmed with provider).
- Multi-segment journeys: Provider returns
Flight[][]where inner arrays represent connected segments forming a complete journey. - Authentication: Service uses API keys for client auth; provider requires Bearer token (stored in secret manager).
- Objective: Build a scalable service that, by consuming the provided Flight Comparison API, returns the best* flight(s) for a user between two airports.
- Tech stack (core logic): Node.js + TypeScript.
- Primary consumers: Web/mobile clients and internal services via a public service API.
- Key constraints:
- Average daily users: 250k. Peak: 1M.
- API requires Bearer authentication.
- “Best*” is configurable per scenario and user preferences.
- Server:
https://aurigaspa.com/flight-engine/api - Security: Bearer token (
Authorization: Bearer <token>) - Endpoints:
GET /flights/all→FlightsCollection(array ofFlight)GET /flights/from/{departureCode}/to/{arrivalCode}→TravelSolutions(array of arrays ofFlight, ordered by increasing price)GET /airports/all→AirportsCollectionGET /airlines/all→AirlinesCollection
- Data models (essential fields):
Flight:id,airlineId,departureAirportId,arrivalAirportId,priceAirport:id,codeIata,latitude,longitudeAirline:id,name,codeIataPrefix,logoFilename
- Goals
- Return best* flight(s) with low latency (<250 ms p95 after warm cache; <1.5 s cold).
- Keep external API usage cost-effective via caching, prefetching, and rate limiting.
- Handle peak load elastically.
- Provide observability and safe fallbacks.
- Non-Goals
- Building a booking engine or payment flow.
- Long-term historical analytics (beyond operational metrics).
- Cheapest (MVP): Minimize total journey price (sum of all segment prices). Default mode; provider pre-sorts by price.
- Fewest Stops (MVP): Minimize segment count. Useful for convenience over price.
- User Preferences (MVP): Apply filters for airline whitelist/blacklist, max stops, max price before scoring.
- Total Price: Sum of
priceacross all segments in a solution. - Stops Count:
segments.length - 1(e.g., 2 segments = 1 stop). - Segment Validation: Verify
segments[i].arrivalAirportId === segments[i+1].departureAirportIdfor connectivity. - Journey ID: Hash of segment IDs for deduplication and caching.
- API Cost
- Cost per call may apply; aim for cache hit rate ≥ 85% during peak via route-based caching and prefetching of top O&D (origin and destination) pairs.
- Update Frequency
- If upstream refreshes hourly: set TTL ≈ 10–20 min for volatile data; daily for reference data (airports/airlines) with ETag checks.
- Support cache versioning to flush on detected upstream changes.
- Availability
- Upstream SLA unknown; implement circuit breaker and graceful degrade to last-known good (LKG) cache for ≤ 30 min.
- Client → API Gateway/Edge → Best Flight Service (Node/TS)
- Integrations:
- Flight Provider Client (HTTP, bearer token)
- Cache: Redis (managed) for route queries, Last Known Good (LKG) objects, and session data
- DB: Postgres for user prefs and audit logs (optional: defer to Phase 2 if Redis persistence sufficient)
- Scheduler: Node-cron or Cloud Scheduler for periodic cache warmup of top routes
- Observability (MVP): Cloud-native tools (AWS CloudWatch/X-Ray or GCP Cloud Monitoring/Cloud Trace) for metrics, logs, and traces
- Observability (Phase 2): OpenTelemetry, Prometheus, Grafana for vendor-neutral, advanced observability
- Secret Manager: AWS Secrets Manager / GCP Secret Manager for provider bearer token
- Deferred to Phase 2: Message queue (Kafka/SQS) for high-volume prefetch, CDN for edge caching
- Integrations:
- Client → Service: API Key authentication via
X-API-Keyheader- Keys stored in Redis with rate limit metadata
- Key rotation supported via admin API
- Optional: JWT tokens for user-specific preferences (Phase 2)
- Service → Provider: Bearer token from secret manager, rotated every 90 days
- Rate Limiting:
- Per API key: 100 requests/minute (burst: 150)
- Per route key: 10 requests/minute to prevent cache stampede
- Global: 500 QPS at peak (horizontal scaling trigger)
- Best Flight API
GET /v1/best-flight?from=AAA&to=BBB&mode=cheapest|fewest-stops&maxStops=&maxPrice=&airlines=&excludeAirlines=GET /v1/airportsandGET /v1/airlines(cached passthrough with 24h TTL)GET /v1/health(liveness/readiness probes)
- Provider Adapter
- Wraps
GET /flights/from/{from}/to/{to}with retry logic - Retries: 3 attempts, exponential backoff (100ms, 200ms, 400ms) with jitter
- Circuit breaker: trip after 5 consecutive failures or 50% error rate in 30s window
- Timeout: 1200ms per request
- Wraps
- Caching Layer
- Key format:
route:{from}:{to}:v1:{hash(mode,filters)} - TTL: 15 min for flight data; 24h for reference data (airports, airlines)
- Negative caching: 404 errors cached for 5 min to prevent repeated invalid lookups
- Cache versioning: increment
v1when scoring logic changes
- Key format:
- Scoring Engine
- Deterministic scoring based on mode (cheapest, fewest-stops, other)
- Normalization: price and stops normalized using min-max from current result set
- Cold start: use static percentiles (p50 price: €200, p50 stops: 1) until 100+ queries collected
- Prefetcher
- Cron job every 30 min to warm top 1000 routes (based on previous 7-day query logs)
- Stores results in Redis with extended TTL (60 min) and
prefetched=trueflag
- RouteResult
fromIata,toIata,solutions: Flight[][],providerVersion,fetchedAt
- ScoredSolution
solutionId,score,price,stops,segments: Flight[],explanations[]
- Reference
Airport(id, codeIata, lat, lon)Airline(id, name, codeIataPrefix, logoFilename)
- UserPrefs
maxStops,maxPrice,preferredAirlines[],blockedAirlines[],weights{price,duration,stops}
- Resolve
from,toIATA, validate via cached airports. - Check Redis for
route:{from}:{to}:criteria:{hash}. - On cache hit → return top N with metadata.
- On miss → call provider
GET /flights/from/{from}/to/{to}. - Filter by prefs; compute score according to mode.
- Persist result in Redis (and optionally Postgres for audit/LKG).
- Return top 1 (or top K for client-side choice).
All errors return JSON with:
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable description",
"details": {},
"requestId": "uuid"
}
}| HTTP | Code | Description | Retry? |
|---|---|---|---|
| 400 | INVALID_IATA_CODE |
Airport code must be 3 uppercase letters | No |
| 400 | INVALID_MODE |
Mode must be: cheapest, fewest-stops, balanced | No |
| 400 | INVALID_FILTERS |
maxStops/maxPrice must be positive numbers | No |
| 401 | INVALID_API_KEY |
Missing or invalid X-API-Key header | No |
| 404 | AIRPORT_NOT_FOUND |
Departure or arrival airport not found | No |
| 404 | NO_FLIGHTS_FOUND |
No flights available for this route | Yes (5m) |
| 429 | RATE_LIMIT_EXCEEDED |
Too many requests, retry after N seconds | Yes |
| 500 | INTERNAL_ERROR |
Unexpected server error | Yes |
| 502 | UPSTREAM_ERROR |
Provider API unavailable | Yes |
| 503 | SERVICE_UNAVAILABLE |
Circuit breaker open, using cached data | Yes |
| 504 | UPSTREAM_TIMEOUT |
Provider API timeout | Yes |
- Rate limiting: Token bucket per API key (100/min) and per route (10/min)
- Retries: 3 attempts with exponential backoff (100ms, 200ms, 400ms) + jitter
- Circuit breaker: Open after 5 failures or 50% error rate; half-open after 30s; serve LKG during open state
- Timeouts: Provider 1200ms, total request 2000ms
- Input validation:
- IATA:
^[A-Z]{3}$ - maxStops: integer 0-5
- maxPrice: float > 0
- airlines/excludeAirlines: comma-separated IATA prefixes
^\d{3}(,\d{3})*$
- IATA:
- Graceful degradation: Return LKG cache with
source: "cache-fallback"and warning header
- Flights: TTL 10–30 min; invalidate by schedule (cron) and on anomaly detection (sudden price shift > X%).
- Airports & Airlines: TTL 24 h with background refresh.
- Stale-while-revalidate (SWR): Serve cached result and refresh in background for hot routes.
LKG is a resilience mechanism that serves stale but valid data when the provider API is unavailable.
Storage:
- Key format:
lkg:route:{from}:{to}:v1 - Stored in Redis with extended TTL (60 min) or via Redis persistence (AOF)
- Retains last 1–3 successful snapshots per route
Trigger Conditions:
- Provider API returns 5xx errors or times out (>1200ms)
- Circuit breaker is open (after 5 consecutive failures or 50% error rate)
- Rate limit exceeded on provider (429)
Behavior:
- Serve LKG data for ≤30 min during outages
- Response includes
source: "cache-fallback"and warning message - After 30 min of staleness, return error to client (data too old)
Benefits: Maintains service availability during provider outages, helps achieve 99.9% SLA target.
- Traffic assumptions
- Avg 250k DAU → ~3 QPS sustained (if 1 search/user/day), but peaks and bursts matter. For peak 1M DAU and 10x burst factor: plan for 300–500 QPS.
- With CDN/Redis hits ≥ 85%, upstream calls ≤ 75 QPS during peak.
- Platform: Google Cloud Run / Azure Container Apps / AWS App Runner
- Configuration:
- Container: 1 vCPU, 2GB RAM per instance
- Min instances: 2 (avoid cold starts)
- Max instances: 50 (auto-scale based on concurrency)
- Concurrency: 80 requests per instance
- Auto-scaling: Platform-managed based on request load
- Benefits: Zero cluster management, pay-per-use, built-in load balancing
- When to migrate: Multi-cloud needs, advanced networking, cost optimization at scale
- Setup: GKE/AKS/EKS with HPA on CPU+latency
- Scaling: Start 3–5 pods (1 vCPU/2GB), scale to 20–30 at peak
- Redis: Managed service (ElastiCache/Memorystore/Azure Cache) with 3 shards, ~5–10 GB memory
- Postgres: Managed service (RDS/Cloud SQL/Azure Database) with read replicas if needed (Phase 2)
- CDN caching of GET
/v1/best-flightfor 60–120 s if query hash identical and privacy allows
- Reduce provider calls via:
- High Redis hit rate (target 85%+), SWR, prefetching top 1000 routes
- Request deduplication: coalesce concurrent identical requests (single-flight pattern)
- Negative caching: cache 404s for 5 min to avoid repeated invalid lookups
- Tune TTLs per seasonality and route volatility (15-30 min dynamic)
- Cost tracking:
- Metric:
provider_api_calls_totalwith labelsroute,cache_hit - Dashboard: cost/day, cost/request, cache hit rate impact
- Alert: if daily cost >$1,500 (anomaly detection)
- Metric:
- Request coalescing: Gateway-level dedup for identical concurrent requests
apps/api(Fastify/Express + zod validation)packages/provider-client(OpenAPI-generated TS client)packages/scoring(pure functions + tests)packages/types(shared DTOs)
- Unit: Scoring logic, filters, validation (start on Node.js test runner, consider Jest or Vitest later)
- Contract: Provider API mocks (MSWjs)
- E2E: API flows with real Redis (Testcontainers)
- Load: k6 or Artillery for 500 QPS target
MVP (Serverless Containers):
- Pipeline: GitHub Actions / GitLab CI / Azure DevOps
- Steps: Lint → Typecheck → Test → Build container → Push to registry → Deploy to Cloud Run/Container Apps
- Deployment: Direct deployment via CLI/Terraform
- Rollback: Platform-managed traffic splitting (blue/green)
Phase 2 (Kubernetes):
- GitOps with ArgoCD/Flux
- Helm charts for declarative deployments
- Use LaunchDarkly, Flagsmith, or cloud provider feature flags
- Toggle scoring strategies (cheapest vs. fewest-stops)
- Gradual rollout of new features
- Store provider bearer token in secret manager (cloud provider KMS-backed), rotate regularly.
- TLS everywhere, HSTS at edge (Strict-Transport-Security: max-age=31536000).
- PII minimization; treat airports/airlines as non-PII.
- Input sanitization; outbound allowlist to provider domain.
Request:
GET /v1/best-flight?from=JFK&to=LAX&mode=cheapest&maxStops=1&maxPrice=500
Headers:
X-API-Key: your-api-key-here
Response (200 OK):
{
"best": {
"journeyId": "hash-abc123",
"totalPrice": 299.50,
"currency": "EUR",
"stops": 0,
"score": 0.85,
"segments": [
{
"id": 12345,
"airlineId": 1,
"airlineName": "Example Airlines",
"departureAirportId": 101,
"departureCode": "JFK",
"arrivalAirportId": 202,
"arrivalCode": "LAX",
"price": 299.50
}
],
"explanation": "Cheapest direct flight"
},
"alternatives": [
{
"journeyId": "hash-def456",
"totalPrice": 325.00,
"currency": "EUR",
"stops": 1,
"score": 0.78,
"segments": [...],
"explanation": "One stop via ORD"
}
],
"metadata": {
"source": "cache",
"fetchedAt": "2025-10-29T08:30:00Z",
"cacheAge": 120,
"requestId": "req-uuid-789"
}
}Response (404 Not Found):
{
"error": {
"code": "NO_FLIGHTS_FOUND",
"message": "No flights available from JFK to LAX",
"details": {
"from": "JFK",
"to": "LAX"
},
"requestId": "req-uuid-790"
}
}Response: Cached passthrough of provider /airports/all with 24h TTL
Response: Cached passthrough of provider /airlines/all with 24h TTL
- Environments: dev, staging, prod
- Platform: Cloud Run / Azure Container Apps / AWS App Runner
- IaC: Terraform or cloud provider CLI (gcloud, az, aws)
- Rollout: Platform-managed traffic splitting (0% → 10% → 50% → 100%)
- Rollback: Automatic on health check failures or manual via CLI
- Config: Environment variables injected from Secret Manager
- IaC: Terraform + Helm charts
- GitOps: ArgoCD or Flux for declarative deployments
- Rollout: Canary with Flagger (gradual traffic shift based on metrics)
- Rollback: Automated on SLO violation (latency, error rate)
| Metric | Target | Alert Threshold |
|---|---|---|
| p95 Latency (warm) | <250ms | >300ms for 5 min |
| p95 Latency (cold) | <1500ms | >2000ms for 5 min |
| Cache Hit Rate | ≥85% | <80% for 10 min |
| Error Rate | <0.5% | >1% for 5 min |
| Provider API Success | >99% | <95% for 5 min |
| Circuit Breaker Trips | 0/hour | >3/hour |
| QPS | 3-500 | >600 (scale trigger) |
| Redis Memory | <8GB | >9GB |
| API Key Rate Limits | <50 hits/min | >100 hits/min per key |
- Service Health: Latency percentiles, error rates, throughput
- Cache Performance: Hit/miss rates, eviction rates, memory usage
- Provider Integration: Success rate, latency, circuit breaker state
- Business Metrics: Top routes, mode distribution, average price
- Cost Tracking: Provider API calls/hour, estimated cost/day
- P0 (Immediate): Service down, error rate >5%, provider circuit breaker open >10 min
- P1 (15 min): Latency >2s, cache hit rate <70%, Redis connection failures
- P2 (1 hour): Cache hit rate <80%, unusual traffic patterns, cost anomalies
- Structured JSON logs with fields:
requestId,apiKey,from,to,mode,latency,cacheHit,error - Log levels: ERROR (5xx, circuit breaker), WARN (cache miss, slow query), INFO (requests), DEBUG (scoring details)
- Retention: 7 days hot, 30 days cold storage
MVP (Cloud-Native):
- AWS X-Ray or GCP Cloud Trace for distributed tracing
- Automatic instrumentation for HTTP requests, Redis calls
- Sample rate: 100% for errors, 10% for success, 100% for requests >1s
Phase 2 (OpenTelemetry):
- Vendor-neutral spans for: API request, cache lookup, provider call, scoring, response serialization
- Exporters to Jaeger/Tempo for advanced trace analysis
- Same sampling strategy with custom span attributes
- p95 latency: <250 ms (warm cache), <1.5 s (cold cache)
- p99 latency: <500 ms (warm), <2.5 s (cold)
- Availability: 99.9% monthly (43 min downtime/month)
- Cache hit rate: ≥ 85% during peak hours
- Error budget: 0.1% (720 failed requests per 720k requests)
- Provider API usage: <15% of total requests (cost optimization)
- Time to recovery: <5 min for service restarts, <15 min for provider outages (via LKG cache)
- Client calls
/v1/best-flight?from=AAA&to=BBB&mode=cheapest. - Service validates and checks Redis.
- Cache miss triggers provider call; response scored and cached.
- Service returns best solution with
source=liveand sets SWR refresh.
flowchart TB
User[End User]
ClientApp[Web/Mobile App]
BestFlightService[[Best Flight Service]]
FlightProvider[(Auriga Flight Provider API)]
Redis[(Redis Cache)]
SecretManager[(Secret Manager)]
User -->|searches flights| ClientApp
ClientApp -->|HTTPS + API Key| BestFlightService
BestFlightService -->|query/cache| Redis
BestFlightService -->|HTTPS + Bearer Token| FlightProvider
BestFlightService -->|fetch secrets| SecretManager
flowchart TB
ClientApp[Client Apps]
subgraph Edge[Edge Layer]
APIGW[API Gateway + Rate Limiter]
end
subgraph Service[Best Flight Service - Serverless Container]
Controllers[API Controllers]
ProviderAdapter[Provider Adapter + Circuit Breaker]
Scoring[Scoring Engine]
CacheLayer[Caching Layer]
Prefetcher[Prefetcher Job]
end
Redis[(Redis Cache + LKG)]
SecretManager[(Secret Manager)]
FlightProvider[(Auriga Flight Provider API)]
Scheduler[Cloud Scheduler / Node-cron]
Observability[(CloudWatch / Stackdriver)]
ClientApp -->|HTTPS + X-API-Key| APIGW
APIGW -->|validate key| Redis
APIGW -->|route request| Controllers
Controllers -->|check cache| CacheLayer
Controllers -->|validate airports| CacheLayer
Controllers -->|score solutions| Scoring
CacheLayer <-->|get/set| Redis
Controllers -->|on cache miss| ProviderAdapter
ProviderAdapter -->|fetch token| SecretManager
ProviderAdapter -->|HTTPS + Bearer| FlightProvider
Scheduler -->|trigger every 30min| Prefetcher
Prefetcher -->|warm cache| ProviderAdapter
Prefetcher -->|store results| Redis
Controllers -->|logs/metrics/traces| Observability
ProviderAdapter -->|logs/metrics| Observability
APIGW -->|access logs| Observability
Notes:
- Postgres and message queue deferred to Phase 2
- Redis persistence (AOF) enabled for LKG storage
- Observability uses cloud-native tools (CloudWatch/Stackdriver) for MVP
- Service runs as serverless containers (Cloud Run/Container Apps/App Runner)
- Redis: AOF persistence enabled; daily snapshots to S3; RTO <10 min via snapshot restore
- Secrets: Versioned in secret manager; previous version retained for 30 days
- Logs: 7 days in hot storage (Elasticsearch), 30 days in cold (S3), then deleted
- Metrics: 90 days retention in Prometheus, 1 year aggregated in long-term storage
- Backups: Redis snapshots retained for 7 days
- Kubernetes migration: Move from serverless containers to GKE/AKS/EKS for multi-cloud portability and fine-grained control
- OpenTelemetry observability: Vendor-neutral tracing and metrics with Prometheus/Grafana/Jaeger
- Fastest mode: Integrate schedule API (e.g., FlightAware, OAG) for departure/arrival times
- User accounts: JWT authentication, saved preferences, search history
- Price alerts: Notify users when prices drop below threshold
- Multi-city: Support complex itineraries (A→B→C)
- Seat availability: Integrate real-time seat maps
- Carbon footprint: Display CO2 emissions per flight
- Message queue: Kafka/SQS for high-volume prefetch and async analytics
- CDN: Edge caching for global latency optimization
- ML-based scoring: Predict user preferences based on historical behavior
- A/B testing: Framework for scoring algorithm experiments
- Data minimization: No PII stored unless user accounts added in Phase 2
- Right to deletion: API for user data deletion (if user accounts exist)
- Data retention: Audit logs anonymized after 30 days (remove API key associations)
- Consent: Cookie banner for analytics (if web UI added)
- Data processing agreement: Required with provider if handling EU user data
- Use
TravelSolutions.data: Flight[][]for multi-segment journeys - Each inner
Flight[]array represents a complete journey with connected segments - Map
Airline.idtoFlight.airlineIdfor display and filtering - Validate IATA with regex
^[A-Z]{3}$and cross-check against cachedAirportsCollection - Segment connectivity validation:
segments[i].arrivalAirportId === segments[i+1].departureAirportId
Per Container Instance (1 vCPU, 2GB RAM, 80 concurrency):
- Sustained throughput: 50-100 QPS per instance
- Cold start: <2s (mitigated by min instances = 2)
- Redis lookup: <5ms p95
- Provider API call: 200-800ms p95 (network dependent)
- Scoring computation: <10ms for 50 solutions
- Total warm path: <50ms (cache hit) + serialization
- Total cold path: <1200ms (provider) + 50ms (scoring/caching)
Scaling Behavior:
- At 300 QPS peak: ~6-8 active instances
- At 500 QPS burst: ~10-12 active instances
- Platform auto-scales based on request concurrency
Assumptions: 500k DAU average, 1.5 searches/user/day = 750k requests/day = 22.5M/month
| Component | Cost | Notes |
|---|---|---|
| Compute (Cloud Run/Container Apps) | $150-250 | ~5M vCPU-seconds, 10GB-seconds RAM |
| Redis (10GB managed) | $150 | ElastiCache/Memorystore/Azure Cache |
| Provider API (3.4M (22.5M*15%) calls @ $0.01/call) | $34,000 | 85% cache hit rate |
| Observability (CloudWatch/Stackdriver) | $50-100 | Logs, metrics, traces |
| Secret Manager | $10 | Token storage and rotation |
| Total | ~$34,400/month |
Cost optimization:
- With 85% cache hit rate, provider calls = 3.4M/month
- Each 1% cache hit improvement saves ~$2,250/month
- Serverless compute: pay only for actual request time (vs. always-on pods)
Compute cost may decrease to $100-150/month with reserved instances at sustained load, but adds operational overhead.