A complete local data engineering platform running on Minikube with Apache Airflow, Apache Spark, MinIO, PostgreSQL, and Apache Iceberg.
Supported Platforms: Linux | macOS | Windows
LDP is a local development and testing environment for data engineering workflows. It brings enterprise-grade data tools to your laptop, allowing you to:
- Learn data engineering concepts without cloud costs
- Develop and test data pipelines locally before cloud deployment
- Prototype new data architectures and workflows
- Experiment with modern data tools (Airflow, Spark, Iceberg)
- Run CI/CD tests for data pipelines
LDP is designed to run on your local machine using Minikube. It is NOT intended for production use or cloud deployment. For production workloads, consider:
- Managed services (AWS EMR, Google Cloud Dataproc, Azure Synapse)
- Kubernetes clusters on cloud providers (EKS, GKE, AKS)
- Purpose-built data platforms (Databricks, Snowflake)
LDP gives you a realistic local environment to develop and test before deploying to these production platforms.
- β Data Engineering Students - Learn industry-standard tools without AWS/GCP bills
- β Pipeline Development - Build and debug Airflow DAGs locally before cloud deployment
- β Testing & CI/CD - Run integration tests for data pipelines in GitHub Actions
- β Proof of Concepts - Validate data architecture decisions quickly
- β Tool Evaluation - Try Iceberg, Spark, or Airflow features risk-free
- β Production data workloads (use cloud services instead)
- β Large-scale data processing (limited by laptop resources)
- β Multi-user production environments
- β Long-running production jobs
- β Enterprise SLA requirements
- Apache Airflow - Workflow orchestration and scheduling
- Apache Spark - Distributed data processing (batch and streaming)
- MinIO - S3-compatible object storage
- PostgreSQL - Metadata store for Airflow and Hive
- Apache Iceberg - Modern table format with ACID transactions
- Jupyter - Interactive development environment
New to LDP? Start with our comprehensive tutorial:
π Getting Started Tutorial - Complete hands-on guide with tested examples
The tutorial covers:
- β Platform setup for Windows, Linux, and macOS
- β Working with MinIO (S3-compatible storage)
- β Processing data with Spark
- β Managing Iceberg tables (ACID transactions, time travel)
- β Orchestrating workflows with Airflow
- β Building your own data pipelines
- β Production-ready examples and best practices
All tutorial code is tested and ready to use!
LDP works on macOS, Windows, and Linux. Choose your platform:
- Windows - Use PowerShell scripts and Chocolatey/winget
- macOS - Use Homebrew and native tools
- Linux - Standard package managers
Install the required tools:
Linux/macOS:
# 1. Initial setup (starts Minikube)
make setup
# 2. Deploy the platform
make start
# 3. Check service health
make healthWindows PowerShell:
# 1. Initial setup
.\scripts\windows\setup.ps1
# 2. Deploy the platform
.\scripts\windows\start.ps1
# 3. Check service health
.\scripts\windows\check-health.ps1After deployment, get your Minikube IP:
make minikube-ipAccess the services at:
-
Airflow UI:
http://<minikube-ip>:30080- Username:
admin - Password:
admin
- Username:
-
MinIO Console:
http://<minikube-ip>:30901- Username:
admin - Password:
minioadmin
- Username:
-
Spark Master UI:
http://<minikube-ip>:30707 -
Jupyter:
http://<minikube-ip>:30888- Get token:
kubectl logs -n ldp deployment/jupyter
- Get token:
If NodePort access doesn't work, use port forwarding:
make airflow-forward # http://localhost:8080
make minio-forward # http://localhost:9001
make spark-forward # http://localhost:8080
make jupyter-forward # http://localhost:8888ldp/
βββ terraform/ # Infrastructure as Code
β βββ modules/ # Terraform modules (airflow, spark, minio, postgresql)
β βββ environments/ # Environment-specific configs
β βββ helm-values/ # Custom Helm values
βββ kubernetes/ # Additional K8s manifests
βββ airflow/ # Airflow DAGs and plugins
βββ spark/ # Spark jobs and libraries
βββ docker/ # Custom Docker images
βββ scripts/ # Utility scripts
βββ data/ # Local data storage
βββ config/ # Configuration files
βββ tests/ # Integration and E2E tests
βββ examples/ # Example code
LDP is tested across multiple platforms using GitHub Actions:
- Windows - PowerShell scripts, Terraform, Python
- macOS - Bash scripts, Terraform, Python
- Linux - Full E2E tests with Minikube
See CI/CD Testing Documentation for details.
# Start the platform
make start
# Stop the platform
make stop
# Complete cleanup
make cleanup
# Check health
make health
# View pods
make pods
# View services
make servicesmake init-minio# Run all tests
make test
# Run unit tests only
make test-unit
# Run integration tests
make test-intLDP provides an empty project structure - a blank canvas for your data pipelines. The main directories (airflow/dags/, spark/jobs/, spark/lib/) are intentionally empty, giving you complete freedom to build your own solutions.
Want to see working examples first? Load the example code:
make load-examplesThis copies example DAGs, Spark jobs, libraries, and tests into your project directories. Great for:
- Learning how to structure your code
- Understanding integration patterns
- Quick demos and testing
- Starting point for customization
- Running and exploring the test suite
Ready to build your own? Just create files in the right places:
# Create your first DAG
vim airflow/dags/my_pipeline.py
# Create your first Spark job
vim spark/jobs/process_data.pyWhere to write your code:
airflow/dags/- Your workflow orchestration (DAGs)spark/jobs/- Your data processing logicspark/lib/- Reusable utilities and functionsdata/raw/- Your input datasets
π See Writing Code Guide for detailed instructions and best practices
# airflow/dags/my_etl.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
with DAG('my_etl', start_date=datetime(2024, 1, 1)) as dag:
task = PythonOperator(
task_id='process',
python_callable=lambda: print("Processing data!")
)cp ~/my_dataset.csv data/raw/# Restart to load new code
make stop && make start
# Access Airflow UI and trigger your DAG
open http://$(minikube ip):30080See examples/iceberg_crud.py for complete examples:
# Create Iceberg table
spark.sql("""
CREATE TABLE local.db.my_table (
id BIGINT,
data STRING
) USING iceberg
""")Copy and customize the environment file:
cp config/env/.env.example .envCustomize deployment in terraform/environments/:
local.tfvars- Local development (default)
Apply with local configuration:
cd terraform
terraform apply -var-file=environments/local.tfvars# Check pod status
kubectl get pods -n ldp
# Describe problematic pod
kubectl describe pod <pod-name> -n ldp
# Check logs
kubectl logs <pod-name> -n ldpIncrease Minikube resources:
minikube delete
minikube start --cpus=6 --memory=12288 --disk-size=60g# Check PVCs
kubectl get pvc -n ldp
# Delete and recreate
kubectl delete pvc <pvc-name> -n ldp
make startThe examples/ directory contains reference implementations:
examples/
βββ simple_dag.py # Basic Airflow DAG
βββ spark_job.py # Simple Spark job
βββ iceberg_crud.py # Iceberg table operations
βββ minio_operations.py # MinIO/S3 operations
βββ dags/ # Complete DAG examples
β βββ example_spark_job.py
β βββ data_ingestion/
β βββ data_transformation/
βββ spark-jobs/ # Complete Spark job examples
β βββ batch_processing.py
β βββ streaming_job.py
β βββ iceberg_maintenance.py
βββ spark-lib/ # Reusable library examples
βββ transformations.py
βββ data_quality.py
βββ utils.py
Load examples into your project:
make load-examplesThis copies all examples to their respective directories for testing and learning.
- π Getting Started Tutorial - START HERE! Complete hands-on guide
- Setup Guide - Detailed installation instructions
- Writing Code Guide - Best practices for developing pipelines
- Platform Guides - Windows, macOS, Linux specific guides
- Project Structure - Directory layout and organization
- Hive vs Iceberg - Why we use Iceberg
- Iceberg Catalog - HadoopCatalog explained
- Production Guide - Deploying to production
- CI/CD Testing - Automated testing documentation
- Troubleshooting - Common issues and solutions
Each major directory has its own README explaining its purpose:
- airflow/ - Airflow DAG development
- spark/ - Spark job development
- examples/ - Example code library
- docker/ - Custom Docker images
- config/ - Configuration files
- terraform/ - Infrastructure as Code
- scripts/ - Utility scripts
- tests/ - Testing strategies
See the Documentation Index for the complete list.
MIT License
For issues and questions, please open an issue in the repository.