Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ ldp Public

A complete local data engineering platform running on Minikube with Apache Airflow, Apache Spark, MinIO, PostgreSQL, and Apache Iceberg.

License

Notifications You must be signed in to change notification settings

gridatek/ldp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Local Data Platform (LDP)

CI Testing Platform Tests

A complete local data engineering platform running on Minikube with Apache Airflow, Apache Spark, MinIO, PostgreSQL, and Apache Iceberg.

Supported Platforms: Linux | macOS | Windows

What is LDP?

LDP is a local development and testing environment for data engineering workflows. It brings enterprise-grade data tools to your laptop, allowing you to:

  • Learn data engineering concepts without cloud costs
  • Develop and test data pipelines locally before cloud deployment
  • Prototype new data architectures and workflows
  • Experiment with modern data tools (Airflow, Spark, Iceberg)
  • Run CI/CD tests for data pipelines

Important: Local Development Only

LDP is designed to run on your local machine using Minikube. It is NOT intended for production use or cloud deployment. For production workloads, consider:

  • Managed services (AWS EMR, Google Cloud Dataproc, Azure Synapse)
  • Kubernetes clusters on cloud providers (EKS, GKE, AKS)
  • Purpose-built data platforms (Databricks, Snowflake)

LDP gives you a realistic local environment to develop and test before deploying to these production platforms.

Why Use LDP?

Perfect For

  • βœ… Data Engineering Students - Learn industry-standard tools without AWS/GCP bills
  • βœ… Pipeline Development - Build and debug Airflow DAGs locally before cloud deployment
  • βœ… Testing & CI/CD - Run integration tests for data pipelines in GitHub Actions
  • βœ… Proof of Concepts - Validate data architecture decisions quickly
  • βœ… Tool Evaluation - Try Iceberg, Spark, or Airflow features risk-free

Not Suitable For

  • ❌ Production data workloads (use cloud services instead)
  • ❌ Large-scale data processing (limited by laptop resources)
  • ❌ Multi-user production environments
  • ❌ Long-running production jobs
  • ❌ Enterprise SLA requirements

Features

  • Apache Airflow - Workflow orchestration and scheduling
  • Apache Spark - Distributed data processing (batch and streaming)
  • MinIO - S3-compatible object storage
  • PostgreSQL - Metadata store for Airflow and Hive
  • Apache Iceberg - Modern table format with ACID transactions
  • Jupyter - Interactive development environment

πŸ“š Getting Started Tutorial

New to LDP? Start with our comprehensive tutorial:

πŸ‘‰ Getting Started Tutorial - Complete hands-on guide with tested examples

The tutorial covers:

  • βœ… Platform setup for Windows, Linux, and macOS
  • βœ… Working with MinIO (S3-compatible storage)
  • βœ… Processing data with Spark
  • βœ… Managing Iceberg tables (ACID transactions, time travel)
  • βœ… Orchestrating workflows with Airflow
  • βœ… Building your own data pipelines
  • βœ… Production-ready examples and best practices

All tutorial code is tested and ready to use!

Quick Start

LDP works on macOS, Windows, and Linux. Choose your platform:

  • Windows - Use PowerShell scripts and Chocolatey/winget
  • macOS - Use Homebrew and native tools
  • Linux - Standard package managers

Prerequisites

Install the required tools:

Setup and Deploy

Linux/macOS:

# 1. Initial setup (starts Minikube)
make setup

# 2. Deploy the platform
make start

# 3. Check service health
make health

Windows PowerShell:

# 1. Initial setup
.\scripts\windows\setup.ps1

# 2. Deploy the platform
.\scripts\windows\start.ps1

# 3. Check service health
.\scripts\windows\check-health.ps1

Access Services

After deployment, get your Minikube IP:

make minikube-ip

Access the services at:

  • Airflow UI: http://<minikube-ip>:30080

    • Username: admin
    • Password: admin
  • MinIO Console: http://<minikube-ip>:30901

    • Username: admin
    • Password: minioadmin
  • Spark Master UI: http://<minikube-ip>:30707

  • Jupyter: http://<minikube-ip>:30888

    • Get token: kubectl logs -n ldp deployment/jupyter

Alternative: Port Forwarding

If NodePort access doesn't work, use port forwarding:

make airflow-forward   # http://localhost:8080
make minio-forward     # http://localhost:9001
make spark-forward     # http://localhost:8080
make jupyter-forward   # http://localhost:8888

Project Structure

ldp/
β”œβ”€β”€ terraform/          # Infrastructure as Code
β”‚   β”œβ”€β”€ modules/        # Terraform modules (airflow, spark, minio, postgresql)
β”‚   β”œβ”€β”€ environments/   # Environment-specific configs
β”‚   └── helm-values/    # Custom Helm values
β”œβ”€β”€ kubernetes/         # Additional K8s manifests
β”œβ”€β”€ airflow/            # Airflow DAGs and plugins
β”œβ”€β”€ spark/              # Spark jobs and libraries
β”œβ”€β”€ docker/             # Custom Docker images
β”œβ”€β”€ scripts/            # Utility scripts
β”œβ”€β”€ data/               # Local data storage
β”œβ”€β”€ config/             # Configuration files
β”œβ”€β”€ tests/              # Integration and E2E tests
└── examples/           # Example code

Testing

LDP is tested across multiple platforms using GitHub Actions:

  • Windows - PowerShell scripts, Terraform, Python
  • macOS - Bash scripts, Terraform, Python
  • Linux - Full E2E tests with Minikube

See CI/CD Testing Documentation for details.

Common Operations

Managing the Platform

# Start the platform
make start

# Stop the platform
make stop

# Complete cleanup
make cleanup

# Check health
make health

# View pods
make pods

# View services
make services

Initialize MinIO Buckets

make init-minio

Running Tests

# Run all tests
make test

# Run unit tests only
make test-unit

# Run integration tests
make test-int

Getting Started with Your Code

Start with a Clean Slate

LDP provides an empty project structure - a blank canvas for your data pipelines. The main directories (airflow/dags/, spark/jobs/, spark/lib/) are intentionally empty, giving you complete freedom to build your own solutions.

Option 1: Load Examples (Recommended for Learning)

Want to see working examples first? Load the example code:

make load-examples

This copies example DAGs, Spark jobs, libraries, and tests into your project directories. Great for:

  • Learning how to structure your code
  • Understanding integration patterns
  • Quick demos and testing
  • Starting point for customization
  • Running and exploring the test suite

Option 2: Start from Scratch

Ready to build your own? Just create files in the right places:

# Create your first DAG
vim airflow/dags/my_pipeline.py

# Create your first Spark job
vim spark/jobs/process_data.py

Where to write your code:

  • airflow/dags/ - Your workflow orchestration (DAGs)
  • spark/jobs/ - Your data processing logic
  • spark/lib/ - Reusable utilities and functions
  • data/raw/ - Your input datasets

πŸ“– See Writing Code Guide for detailed instructions and best practices

Development Workflow

1. Write Your Code

# airflow/dags/my_etl.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG('my_etl', start_date=datetime(2024, 1, 1)) as dag:
    task = PythonOperator(
        task_id='process',
        python_callable=lambda: print("Processing data!")
    )

2. Add Your Data

cp ~/my_dataset.csv data/raw/

3. Deploy and Test

# Restart to load new code
make stop && make start

# Access Airflow UI and trigger your DAG
open http://$(minikube ip):30080

Working with Iceberg

See examples/iceberg_crud.py for complete examples:

# Create Iceberg table
spark.sql("""
    CREATE TABLE local.db.my_table (
        id BIGINT,
        data STRING
    ) USING iceberg
""")

Configuration

Environment Variables

Copy and customize the environment file:

cp config/env/.env.example .env

Terraform Variables

Customize deployment in terraform/environments/:

  • local.tfvars - Local development (default)

Apply with local configuration:

cd terraform
terraform apply -var-file=environments/local.tfvars

Troubleshooting

Pods Not Starting

# Check pod status
kubectl get pods -n ldp

# Describe problematic pod
kubectl describe pod <pod-name> -n ldp

# Check logs
kubectl logs <pod-name> -n ldp

Out of Resources

Increase Minikube resources:

minikube delete
minikube start --cpus=6 --memory=12288 --disk-size=60g

Persistent Volume Issues

# Check PVCs
kubectl get pvc -n ldp

# Delete and recreate
kubectl delete pvc <pvc-name> -n ldp
make start

Examples

The examples/ directory contains reference implementations:

examples/
β”œβ”€β”€ simple_dag.py           # Basic Airflow DAG
β”œβ”€β”€ spark_job.py            # Simple Spark job
β”œβ”€β”€ iceberg_crud.py         # Iceberg table operations
β”œβ”€β”€ minio_operations.py     # MinIO/S3 operations
β”œβ”€β”€ dags/                   # Complete DAG examples
β”‚   β”œβ”€β”€ example_spark_job.py
β”‚   β”œβ”€β”€ data_ingestion/
β”‚   └── data_transformation/
β”œβ”€β”€ spark-jobs/             # Complete Spark job examples
β”‚   β”œβ”€β”€ batch_processing.py
β”‚   β”œβ”€β”€ streaming_job.py
β”‚   └── iceberg_maintenance.py
└── spark-lib/              # Reusable library examples
    β”œβ”€β”€ transformations.py
    β”œβ”€β”€ data_quality.py
    └── utils.py

Load examples into your project:

make load-examples

This copies all examples to their respective directories for testing and learning.

Documentation

Getting Started

Understanding LDP

Operations & Deployment

Directory READMEs

Each major directory has its own README explaining its purpose:

See the Documentation Index for the complete list.

License

MIT License

Support

For issues and questions, please open an issue in the repository.

About

A complete local data engineering platform running on Minikube with Apache Airflow, Apache Spark, MinIO, PostgreSQL, and Apache Iceberg.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •