Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
206 lines (157 loc) · 8.42 KB

File metadata and controls

206 lines (157 loc) · 8.42 KB

AIS System Overview (2026-05-27)

Tai lieu nay mo ta chi tiet he thong AIS hien co trong repo, gom cac lop storage, compute, dataflow, va trang thai van hanh.

1) Tong quan muc tieu

AIS la pipeline du lieu khi quyen end-to-end:

  • Thu thap du lieu nhieu nguon (Weather, OpenAQ, Sentinel-5P, MAIAC, ERA5).
  • Dong bo vao Kafka, xu ly bang Spark (streaming + batch).
  • Luu lich su vao Iceberg tren HDFS, dong bo mot phan serving sang Cassandra.
  • Orchestration va giam sat bang Airflow + Monitoring UI.
  • Mo rong PM2.5 training/inference va PM2.5 API.

Tai lieu tham khao tong quan: README.md, PROJECT_FULL_NOTES.md.

2) Kien truc runtime (storage vs compute)

2.1 Storage / Data infrastructure (ngoai K8s)

Dang chay bang Docker Compose (hoac external endpoint khi production):

  • Kafka / Zookeeper
  • HDFS (namenode/datanode)
  • Iceberg warehouse
  • Cassandra (serving)
  • Airflow metadata DB
  • Model artifact storage

File/cau hinh:

2.2 Compute layer (K8s)

Compute layer chay tren Kubernetes:

  • Spark driver/executor pods (Spark-on-K8s)
  • ML training/inference Jobs/CronJobs
  • PM2.5 API Deployment
  • Check jobs

Tools/manifest lien quan:

Luu y: Spark Compose (spark-master/worker) chi la dev fallback, target runtime la Spark-on-K8s.

3) Dataflow tong the

Luong du lieu chinh:

  1. Nguon du lieu -> Python ingest adapters
  2. Ingest -> Kafka topics
  3. Spark streaming -> Iceberg bronze tables
  4. Spark batch -> silver/gold feature tables
  5. Cassandra serving (weather/openaq)
  6. ML training -> model registry
  7. ML inference -> prediction table
  8. PM2.5 API -> doc prediction table (khong chay Spark trong request)

4) Data sources va ingest

Nguon du lieu:

  • WeatherAPI history (hoac local JSON)
  • OpenAQ hourly
  • Sentinel-5P metadata API
  • MAIAC metadata (NASA CMR)
  • ERA5 surface va pressure-level

Ingest layer:

5) Kafka topics

Topics chinh:

  • weather_history
  • openaq-hourly
  • sentinel5p-summary
  • maiac-summary
  • era5-files

Script tao topics: scripts/create_topics.sh

6) Spark jobs (streaming + batch)

6.1 Streaming -> Iceberg bronze

6.2 Silver/Gold (Hanoi PM2.5)

6.3 Tier-2 trajectory (HYSPLIT)

7) Iceberg tables va namespaces

Bootstrap tables do:

Namespaces chinh:

  • ais.weather, ais.air_quality, ais.satellite, ais.features, ais.models, ais.trajectory, ais.predictions

8) Cassandra serving

Cassandra hien phuc vu 2 bang serving:

  • ais_serving.weather_hourly_by_province_day
  • ais_serving.openaq_hourly_by_city_parameter_day

Job dong bo:

9) Airflow orchestration

DAG chinh:

Helper: airflow/dags/ais_dag_utils.py

10) ML training, inference, va PM2.5 API

Training/Inference scripts:

PM2.5 API (read-only serving):

K8s manifest:

Prediction table:

  • ais.predictions.hanoi_pm25_forecast_gold

Model registry table:

  • ais.models.hanoi_pm25_model_registry_gold

11) K8s runtime details (where it runs)

  • K8s workload chay trong namespace ais.
  • Spark submit tao Kubernetes Job chay spark-submit trong image ais-spark-runtime, job nay tao driver/executor pods.
  • ConfigMap ais-runtime-config chua endpoints Kafka/HDFS/Iceberg/Cassandra va tham so runtime.
  • Docker Desktop local dung host.docker.internal de K8s pods truy cap Compose services.
  • HDFS datanode can publish port 9866 va advertise host phu hop de pods doc block thanh cong.

Tai lieu lien quan:

12) Monitoring va UI

Monitoring UI:

UI frontend:

  • Demo voi mock data trong ui

13) Cach chay nhanh (tham khao)

14) Trang thai hien tai (rut gon)

  • Core pipeline (ingest -> Kafka -> Spark -> Iceberg -> Cassandra): da co.
  • TODO1 Hanoi PM2.5 silver/gold: da co.
  • TODO2 Tier-2 trajectory: da co.
  • TODO3 K8s compute layer: da co nhung can hardening cho production (monitoring, quota, secrets, stable endpoints).
  • UI: demo mock data, chua noi data production.