Snakepit 🐍

A high-performance, generalized process pooler and session manager for external language integrations in Elixir

⚡ Quick Install

1. Add to mix.exs

{:snakepit, "~> 0.6"}

2. Install Elixir deps

mix deps.get

3. Install Python deps (ONE command)

./deps/snakepit/scripts/setup_python.sh

That's it! The script auto-detects uv (fast) or pip (fallback) and installs everything.

Manual Setup (if needed)

cd deps/snakepit/priv/python
pip install -r requirements.txt

Then run: mix test to verify everything works.

🚀 What is Snakepit?

Snakepit is a battle-tested Elixir library that provides a robust pooling system for managing external processes (Python, Node.js, Ruby, R, etc.). Born from the need for reliable ML/AI integrations, it offers:

Lightning-fast concurrent initialization - 1000x faster than sequential approaches
Session-based execution with automatic worker affinity
gRPC-based communication - Modern HTTP/2 protocol with streaming support
Native streaming support - Real-time progress updates and progressive results (gRPC)
Adapter pattern for any external language/runtime
Built on OTP primitives - DynamicSupervisor, Registry, GenServer
Production-ready with telemetry, health checks, and graceful shutdowns

📋 Table of Contents

⚠️ Breaking Changes (v0.5.0)

DSPy Integration Removed

The DSPy-specific integration (snakepit_bridge.dspy_integration) has been removed in v0.5.0 (deprecated in v0.4.3).

Why? Following clean architecture principles:

Snakepit is a generic Python bridge (like JDBC for databases)
DSPy is a domain-specific library for prompt programming
Domain logic belongs in applications (DSPex), not infrastructure (Snakepit)

Affected Code If you're importing these classes from Snakepit:

from snakepit_bridge.dspy_integration import (
    VariableAwarePredict,
    VariableAwareChainOfThought,
    VariableAwareReAct,
    VariableAwareProgramOfThought,
)

Migration Path For DSPex users, update your imports to:

from dspex_adapters.dspy_variable_integration import (
    VariableAwarePredict,
    VariableAwareChainOfThought,
    VariableAwareReAct,
    VariableAwareProgramOfThought,
)

No API changes - it's a drop-in replacement.

For non-DSPex users, if you're using these classes directly:

Option A: Switch to DSPex for DSPy integration
Option B: Copy the code to your project before v0.5.0
Option C: Pin Snakepit to ~> 0.4.3 (not recommended)

Timeline

v0.4.3 (Oct 2025): Deprecation warnings added, code still works
v0.5.0 (Oct 2025): DSPy integration removed from Snakepit

Documentation

Note: VariableAwareMixin (the base mixin) remains in Snakepit as it's generic and useful for any Python integration, not just DSPy.

🆕 What's New in v0.6.7

Type system + performance + distributed telemetry – v0.6.7 delivers two major enhancements: Phase 1 of the type system improvements with 6x performance boost, and a complete distributed telemetry system for full observability across Elixir clusters and Python workers.

Phase 1: Type System MVP + Performance

6x JSON performance – Python bridge now uses orjson for serialization, delivering 4-6x speedup for raw JSON operations and 1.5x improvement for large payloads with full backward compatibility (priv/python/tests/test_orjson_integration.py).
Structured error types – New Snakepit.Error struct provides detailed context for debugging with fields like category, message, details, python_traceback, and grpc_status (lib/snakepit/error.ex, test/unit/error_test.exs).
Complete type specifications – All public API functions in Snakepit module now have @spec annotations with structured error return types for better IDE support and Dialyzer analysis.

Phase 2: Distributed Telemetry System

Bidirectional telemetry streaming – Python workers emit events via gRPC that are re-emitted as Elixir :telemetry events for unified observability across your entire stack.
43 telemetry events – Complete event catalog across 3 layers (Infrastructure, Python Execution, gRPC Bridge) with atom-safe event names.
Python telemetry API – High-level API with telemetry.emit() and telemetry.span() for automatic timing, plus correlation ID propagation.
Runtime control – Adjust sampling rates, enable/disable telemetry, and filter events for individual workers without restarts.
Integration ready – Works seamlessly with Prometheus, StatsD, OpenTelemetry, and other monitoring tools via :telemetry.attach().
High performance – <10μs overhead per event, <1% CPU impact at 100% sampling, with bounded queues and graceful degradation.

See TELEMETRY.md for the complete telemetry guide with usage examples and integration patterns.

Zero breaking changes – All 235+ existing tests pass; full backward compatibility maintained while adding new functionality.

🆕 What's New in v0.6.6

Bridge resilience + defensive defaults – v0.6.6 closes the last gaps from the critical bug sweep and documents the new reliability posture across the stack.

Persistent worker ports & channel reuse – gRPC workers now cache the OS-assigned port and BridgeServer reuses the worker-owned channel before dialing a fallback, eliminating connection churn (test/unit/grpc/grpc_worker_ephemeral_port_test.exs, test/snakepit/grpc/bridge_server_test.exs).
Hardened registries & quotas – ETS tables ship with :protected visibility and DETS handles stay private while SessionStore enforces session/program quotas (test/unit/pool/process_registry_security_test.exs, test/unit/bridge/session_store_test.exs).
Strict parameter validation – Tool invocations fail fast with descriptive errors when protobuf payloads contain malformed JSON or when parameters cannot be JSON encoded, keeping both client and server paths crash-free (test/snakepit/grpc/bridge_server_test.exs, test/unit/grpc/client_impl_test.exs).
Actionable streaming fallback – When streaming support is disabled, BridgeServer.execute_streaming_tool/2 now returns an UNIMPLEMENTED RPC error with remediation hints so callers can downgrade gracefully (test/snakepit/grpc/bridge_server_test.exs).
Metadata-driven pool routing – Worker registry entries publish pool identifiers so the pool manager resolves ownership without brittle string parsing; fallbacks log once for malformed IDs (test/unit/pool/pool_registry_lookup_test.exs).
Streaming chunk contract – The streaming callback now receives consistent chunk_id/data/is_final payloads with metadata fan-out, documented alongside regression coverage (test/snakepit/streaming_regression_test.exs).
Redacted diagnostics – the logger redaction helper now summarises sensitive payloads instead of dumping secrets or large blobs into logs (test/unit/logger/redaction_test.exs).

🆕 What's New in v0.6.5

Release safety + lifecycle hardening – v0.6.5 fixes production boot regressions and closes gaps in worker shutdown so pools behave predictably during restarts.

Release-friendly application start – Snakepit.Application no longer calls Mix.env/0, letting OTP releases boot without bundling Mix.
Accurate worker teardown – Snakepit.Pool.WorkerSupervisor.stop_worker/1 now targets the worker starter supervisor and accepts either worker ids or pids, preventing leaking processes.
Profile parity – Process and threaded worker profiles resolve worker ids through the registry so lifecycle manager shutdowns succeed regardless of handle type.
Regression coverage – Added unit suites covering supervisor stop/restart behaviour and profile-level shutdown helpers.
Config-friendly thread limits – Partial overrides of :python_thread_limits merge with defaults, keeping startup resilient while allowing fine-grained tuning.

🆕 What's New in v0.6.4

Streaming stability + tooling – v0.6.4 polishes the gRPC streaming path and supporting tooling so real-time updates flow as expected.

Chunk-by-chunk pacing – Python bridge servers now yield streaming results incrementally, decoding payloads on the Elixir side with is_final, metadata, and callback guardrails.
Showcase improvements – stream_progress supports configurable pacing and elapsed timings; examples/stream_progress_demo.exs prints rich updates.
Regression guard – Added test/snakepit/streaming_regression_test.exs plus Python coverage executed via the new helper script.
Instant pytest runs – ./test_python.sh regenerates protobuf stubs, activates .venv, wires PYTHONPATH, and forwards args to pytest.

🆕 What's New in v0.6.3

Flexible Heartbeat Failure Handling - v0.6.3 introduces dependent/independent heartbeat modes, allowing workers to optionally continue running when Elixir heartbeats fail. Perfect for debugging scenarios or when you want Python workers to remain alive despite connectivity issues.

Heartbeat Independence Mode - New dependent: false configuration option allows workers to survive heartbeat failures
Environment-based Configuration - Heartbeat settings now passed via SNAKEPIT_HEARTBEAT_CONFIG environment variable
Python Test Coverage - Added comprehensive unit tests for dependent heartbeat termination behavior
Default Heartbeat Enabled - Heartbeat monitoring now enabled by default for better production reliability

See the CHANGELOG for complete details.

🆕 What's New in v0.6.2

❤️ Heartbeat Reliability Hardening

Workers now shut down automatically when their heartbeat monitor crashes, ensuring unhealthy Python processes never get reused
Added end-to-end regression coverage that exercises missed heartbeat scenarios, validates registry cleanup, and confirms OS-level process termination
Extended heartbeat monitor regression guards to watch for drift across sustained ping/pong cycles

🔬 Telemetry Regression Coverage

Python bridge regression now verifies outbound metadata preserves correlation identifiers when proxying requests back to Elixir
Expanded telemetry fixtures and test harnesses surface misconfigurations by defaulting SNAKEPIT_OTEL_CONSOLE to disabled during tests

🛠️ Developer Experience

make test honors your project virtualenv, exports PYTHONPATH, and runs mix test --color for consistent local feedback loops
Added heartbeat & observability deep-dive notes plus a consolidated testing command crib sheet under docs/20251019/

🆕 What's New in v0.6.1

🔇 Configurable Logging System

Snakepit v0.6.1 introduces fine-grained control over internal logging for cleaner output in production and demo environments.

Key Features

Centralized Log Control: New Snakepit.Logger module provides consistent logging across all internal modules
Application-Level Configuration: Simple :log_level setting controls all Snakepit logs
Five Log Levels: :debug, :info, :warning, :error, :none
No Breaking Changes: Defaults to :info level for backward compatibility

Configuration Examples

Clean Output (Recommended for Production/Demos):

# config/config.exs
config :snakepit,
  log_level: :warning,  # Only warnings and errors
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_config: %{pool_size: 8}

# Also suppress gRPC logs
config :logger,
  level: :warning,
  compile_time_purge_matching: [
    [application: :grpc, level_lower_than: :error]
  ]

Verbose Logging (Development/Debugging):

# config/dev.exs
config :snakepit,
  log_level: :debug  # See everything

config :logger, level: :debug

Complete Silence:

config :snakepit,
  log_level: :none  # No Snakepit logs at all

What Gets Suppressed

With log_level: :warning:

✅ Worker initialization messages
✅ Pool startup progress
✅ Session creation logs
✅ gRPC connection details
✅ Tool registration confirmations
❌ Warnings and errors (still shown)

Module Coverage

Updated 25+ internal modules to use Snakepit.Logger:

Snakepit.Config - Configuration validation
Snakepit.Pool.* - Pool management, worker lifecycle
Snakepit.Bridge.* - Session and tool management
Snakepit.GRPC.* - gRPC communication
Snakepit.Adapters.* - Adapter implementations
Snakepit.Worker.* - Worker lifecycle
Snakepit.Telemetry - Monitoring and metrics

Benefits

Cleaner Demos: Show only your application output, not infrastructure logs
Production Ready: Reduce log volume in production environments
Flexible Debugging: Turn on verbose logs when troubleshooting
Selective Visibility: Keep important warnings/errors while hiding noise

📡 Production-Ready Observability

New Snakepit.Telemetry.OpenTelemetry boots OTLP exporters when SNAKEPIT_ENABLE_OTLP=true
Prometheus metrics server via Snakepit.TelemetryMetrics, covering heartbeat and worker execution stats
Configurable exporters, ports, and resource attributes from config/config.exs
Expanded docs set in ARCHITECTURE.md and new design blueprints for v0.7/v0.8 planning

❤️‍🔥 Proactive Worker Heartbeats

Snakepit.HeartbeatMonitor tracks per-worker liveness with configurable ping cadence and tolerances
gRPC worker now emits heartbeat and execution telemetry, including tracing spans and correlation IDs
Python bridge ships heartbeat helpers and refactored threaded server instrumentation
New end-to-end tests exercise heartbeat failure detection and recovery paths

🐍 Python Bridge Instrumentation

Added snakepit_bridge.telemetry with OTLP-ready metrics and structured logging
gRPC servers expose detailed request accounting, streaming stats, and thread usage insights
Telemetry unit tests guard the Python adapters and ensure compatibility across execution modes

⚙️ Configuration & Examples Refresh

config/config.exs now ships safe defaults for OTLP, Prometheus, and heartbeat envelopes
Sample scripts updated with new monitoring stories, plus fresh dual-mode demos and telemetry walkthroughs
Additional docs under docs/2025101x/ capture upgrade strategies, design prompts, and heartbeat rollout guides

🆕 What's New in v0.6.0

🎯 Dual-Mode Parallelism Architecture

Snakepit v0.6.0 introduces a transformative dual-mode architecture enabling you to choose between multi-process workers (proven stability) and multi-threaded workers (Python 3.13+ free-threading). This positions Snakepit as the definitive Elixir/Python bridge for the next decade of ML/AI workloads.

Process Profile (Default, Backward Compatible)

Many single-threaded Python processes
Process isolation and GIL compatibility
Best for: I/O-bound workloads, high concurrency, legacy Python (≤3.12), thread-unsafe libraries
Proven: Battle-tested in v0.5.x with 250+ worker pools

Thread Profile (New, Python 3.13+ Optimized)

Few multi-threaded Python processes with shared memory
True CPU parallelism via free-threading (GIL-free)
Best for: CPU-bound workloads, Python 3.13+, large shared data (models, tensors)
Performance: Up to 9.4× memory savings, 4× CPU throughput

🔄 Worker Lifecycle Management

Automatic worker recycling prevents memory leaks and ensures long-running pool health:

TTL-based recycling: Workers automatically restart after configurable time (e.g., 2 hours)
Request-count recycling: Refresh workers after N requests (e.g., 5000 requests)
Memory threshold recycling: Recycle if worker memory exceeds limit (optional)
Graceful replacement: Zero-downtime worker rotation
Health monitoring: Periodic checks with automatic failure detection

config :snakepit,
  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 100,
      worker_ttl: {3600, :seconds},      # Recycle after 1 hour
      worker_max_requests: 5000          # Or after 5000 requests
    }
  ]

📊 Enhanced Diagnostics & Monitoring

Production-grade observability for your worker pools:

Real-Time Pool Inspection

# Interactive pool inspection
mix snakepit.profile_inspector

# Get optimization recommendations
mix snakepit.profile_inspector --recommendations

# Detailed worker stats
mix snakepit.profile_inspector --detailed

# JSON output for automation
mix snakepit.profile_inspector --format json

Enhanced Scaling Diagnostics

# System-wide scaling analysis with profile comparison
mix diagnose.scaling

Comprehensive Telemetry (6 New Events)

Worker Lifecycle:

[:snakepit, :worker, :recycled]
# Measurements: none
# Metadata: %{worker_id, pool, reason, uptime, request_count}

[:snakepit, :worker, :health_check_failed]
# Measurements: none
# Metadata: %{worker_id, pool, error}

Pool Monitoring:

[:snakepit, :pool, :saturated]
# Measurements: %{queue_size, max_queue_size}
# Metadata: %{pool, available_workers, busy_workers}

[:snakepit, :pool, :capacity_reached]
# Measurements: %{capacity, load}
# Metadata: %{worker_pid, profile, rejected}

Request Tracking:

[:snakepit, :request, :executed]
# Measurements: %{duration_us}
# Metadata: %{pool, worker_id, command, success}

[:snakepit, :worker, :initialized]
# Measurements: %{initialization_time}
# Metadata: %{worker_id, pool}

See docs/telemetry_events.md for complete reference with usage examples.

🐍 Python 3.13+ Free-Threading Support

Full support for Python's GIL removal (PEP 703):

Automatic detection: Snakepit detects Python 3.13+ free-threading support
Thread-safe adapters: Built-in ThreadSafeAdapter base class with locking primitives
Safety validation: Runtime ThreadSafetyChecker detects concurrent access issues
Library compatibility: Documented compatibility for 20+ popular libraries
Three proven patterns: Shared read-only, thread-local storage, locked mutable state

Thread-Safe Libraries ✅

NumPy, PyTorch, TensorFlow, Scikit-learn, XGBoost, Transformers, Requests, Polars

Thread-Unsafe Libraries ⚠️

Pandas, Matplotlib, SQLite3 (use with locking or process profile)

🔧 Configuration System Enhancements

Powerful multi-pool configuration with profile selection:

# Legacy single-pool config (still works!)
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_size: 100

# New multi-pool config with different profiles
config :snakepit,
  pools: [
    # API workloads: Process profile for high concurrency
    %{
      name: :api_pool,
      worker_profile: :process,
      pool_size: 100,
      adapter_module: Snakepit.Adapters.GRPCPython,
      worker_ttl: {7200, :seconds}
    },

    # CPU workloads: Thread profile for Python 3.13+
    %{
      name: :compute_pool,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--max-workers", "16"],
      worker_ttl: {3600, :seconds},
      worker_max_requests: 1000
    }
  ]

🔧 Key Modules Added

Elixir:

Snakepit.WorkerProfile - Behavior for pluggable parallelism strategies
Snakepit.WorkerProfile.Process - Multi-process profile
Snakepit.WorkerProfile.Thread - Multi-threaded profile
Snakepit.Worker.LifecycleManager - Automatic worker recycling
Snakepit.Diagnostics.ProfileInspector - Pool inspection
Snakepit.Config - Multi-pool configuration
Snakepit.Compatibility - Thread-safety database
Snakepit.PythonVersion - Python 3.13+ detection
mix snakepit.profile_inspector - Pool inspection Mix task
Enhanced mix diagnose.scaling - Profile-aware scaling analysis

Python:

grpc_server_threaded.py - Multi-threaded gRPC server
base_adapter_threaded.py - Thread-safe adapter base
thread_safety_checker.py - Runtime validation toolkit
threaded_showcase.py - Thread-safe patterns showcase

Documentation:

README_THREADING.md - Comprehensive threading guide
docs/migration_v0.5_to_v0.6.md - Migration guide
docs/performance_benchmarks.md - Quantified improvements
docs/guides/writing_thread_safe_adapters.md - Complete tutorial
docs/telemetry_events.md - Telemetry reference

📈 Performance Improvements

Memory Efficiency

100 concurrent operations:
  Process Profile: 15.0 GB (100 processes)
  Thread Profile:   1.6 GB (4 processes × 16 threads)
  Savings: 9.4× reduction!

CPU Throughput (CPU-intensive workloads)

Data processing jobs:
  Process Profile:   600 jobs/hour
  Thread Profile:  2,400 jobs/hour
  Improvement: 4× faster!

Startup Time

Pool initialization:
  Process Profile: 60s (100 workers, batched)
  Thread Profile:  24s (4 workers, fast threads)
  Improvement: 2.5× faster

🛡️ Zero Breaking Changes

100% backward compatible with v0.5.x - your existing code works unchanged:

# All v0.5.x configurations continue to work exactly as before
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_size: 100

# API calls unchanged
{:ok, result} = Snakepit.execute("ping", %{})

📚 Comprehensive Documentation

Extensive new documentation covering all features:

Migration Guide - Zero-friction upgrade path
Performance Benchmarks - Quantified improvements
Thread Safety Guide - Complete tutorial
Telemetry Reference - Monitoring integration
Python Threading Guide - Python developer tutorial

🎓 When to Use Which Profile

Choose Process Profile For:

✅ Python ≤3.12 (GIL present)
✅ I/O-bound workloads (APIs, web scraping, database queries)
✅ High concurrency needs (100-250 workers)
✅ Thread-unsafe libraries (Pandas, Matplotlib, SQLite3)
✅ Maximum process isolation

Choose Thread Profile For:

✅ Python 3.13+ with free-threading
✅ CPU-bound workloads (ML inference, data processing, numerical computation)
✅ Large shared data (models, configurations, lookup tables)
✅ Memory constraints (shared interpreter saves RAM)
✅ Thread-safe libraries (NumPy, PyTorch, Scikit-learn)

Use Both Profiles (Hybrid Pools)

Run different workload types in separate pools with appropriate profiles!

📦 Quick Adoption

For Existing Users (v0.5.x → v0.6.0)

# 1. Update dependency
    {:snakepit, "~> 0.6.7"}

# 2. No config changes required! But consider adding:
config :snakepit,
  pooling_enabled: true,
  pool_config: %{
    worker_ttl: {3600, :seconds},      # Prevent memory leaks
    worker_max_requests: 5000          # Automatic worker refresh
  }

# 3. Your code works unchanged
{:ok, result} = Snakepit.execute("command", %{})

For Python 3.13+ Users

# Adopt thread profile for CPU workloads
config :snakepit,
  pools: [
    %{
      name: :default,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--max-workers", "16"]
    }
  ]

📚 New Examples

Dual-Mode (3 examples):

examples/dual_mode/process_vs_thread_comparison.exs - Side-by-side performance comparison
examples/dual_mode/hybrid_pools.exs - Multiple pools with different profiles
examples/dual_mode/gil_aware_selection.exs - Automatic Python version detection

Lifecycle (1 example):

examples/lifecycle/ttl_recycling_demo.exs - TTL-based worker recycling demonstration

Monitoring (1 example):

examples/monitoring/telemetry_integration.exs - Telemetry setup and integration examples

Threading (1 example):

examples/threaded_profile_demo.exs - Thread profile configuration patterns

Utility:

examples/run_examples.exs - Automated example runner with status reporting

🔍 Implementation Details

New Modules: 14 Elixir files, 5 Python files
Test Coverage: 43 unit tests (93% pass rate) + 9 new test files
Example Scripts: 7 new working demos
Breaking Changes: ZERO
Backward Compatibility: 100%

📋 Implementation Status

Phase 1 ✅ Complete - Foundation modules and behaviors defined
Phase 2 ✅ Complete - Multi-threaded Python worker implementation
Phase 3 ✅ Complete - Elixir thread profile integration
Phase 4 ✅ Complete - Worker lifecycle management and recycling
Phase 5 ✅ Complete - Enhanced diagnostics and monitoring
Phase 6 🔄 In Progress - Additional documentation and examples

🧪 Testing Status

43 unit tests with 93% pass rate
9 new test files for v0.6.0 features:
- test/snakepit/compatibility_test.exs - Library compatibility matrix
- test/snakepit/config_test.exs - Multi-pool configuration
- test/snakepit/integration_test.exs - End-to-end integration
- test/snakepit/multi_pool_execution_test.exs - Multi-pool execution
- test/snakepit/pool_multipool_integration_test.exs - Pool integration
- test/snakepit/python_version_test.exs - Python detection
- test/snakepit/thread_profile_python313_test.exs - Python 3.13 threading
- test/snakepit/worker_profile/process_test.exs - Process profile
- test/snakepit/worker_profile/thread_test.exs - Thread profile
Comprehensive integration tests for multi-pool execution
Python 3.13 free-threading compatibility tests
Thread profile capacity management tests

🆕 What's New in v0.5.1

Worker Pool Scaling Enhancements

Fixed worker pool scaling limits - Pool now reliably scales to 250+ workers (previously limited to ~105)
Resolved thread explosion during concurrent startup - Fixed "fork bomb" caused by Python scientific libraries spawning excessive threads
Dynamic port allocation - Workers now use OS-assigned ports (port=0) eliminating port collision races
Batched worker startup - Configurable batch size and delay prevents system resource exhaustion
Enhanced resource limits - Added max_workers safeguard (1000) with comprehensive warnings
New diagnostic tools - Added mix diagnose.scaling task for bottleneck analysis

Configuration Improvements

Aggressive thread limiting - Set OPENBLAS_NUM_THREADS=1, OMP_NUM_THREADS=1, MKL_NUM_THREADS=1 for optimal pool-level parallelism
Batched startup configuration - startup_batch_size: 8, startup_batch_delay_ms: 750
Increased resource limits - Extended port_range: 1000, GRPC backlog: 512, worker timeout: 30s
Explicit port range constraints - Added configuration documentation and validation

Performance & Reliability

Successfully tested with 250 workers - Validated reliable operation at 2.5x previous limit
Eliminated port collision races - Dynamic port allocation prevents startup failures
Improved error diagnostics - Better logging and resource tracking during pool initialization
Enhanced GRPC server - Better port binding error handling and connection management

Notes

Startup time increases with large pools (~60s for 250 workers vs ~10s for 100 workers)
Thread limiting optimizes for high concurrency; CPU-intensive tasks per worker may need adjustment
See commit dc67572 for detailed technical analysis and future considerations

🆕 What's New in v0.5.0

Breaking Changes

DSPy Integration Removed - As announced in v0.4.3
- Removed deprecated dspy_integration.py module
- Removed deprecated types.py with VariableType enum
- Users must migrate to DSPex for DSPy functionality
- See migration guide in deprecation notice above

Test Infrastructure & Quality

Comprehensive test improvements
- Added Supertester refactoring plan and Phase 1 foundation
- New assert_eventually helper for deterministic async testing
- Increased test coverage from 27 to 51 tests (+89%)
- 37 Elixir tests + 15 Python tests passing

Code Cleanup

Removed dead code and obsolete modules
- Streamlined Python SessionContext
- Deleted obsolete backup files and unused modules
- Cleaned up test infrastructure
- Created Python test infrastructure with test_python.sh

Documentation

Phase 1 completion report with detailed test results
Python cleanup and testing infrastructure summary
Enhanced test planning documentation

🆕 What's New in v0.4.2

✨ Systematic Cleanup & Quality Improvements

Removed dead code - Deleted unused modules and aspirational APIs
Fixed adapter defaults - ShowcaseAdapter now default (fully functional)
DETS cleanup optimization - Prevents indefinite growth, fast startup
Atomic session creation - Eliminates race condition error logs
Python venv auto-detection - Automatically finds .venv for development
Issue #2 addressed - Simplified OTP patterns, removed redundant checks

📚 Enhanced Documentation

Complete installation guide - Platform-specific (Ubuntu, macOS, WSL, Docker)
ADR-001 - Architecture Decision Record for Worker.Starter pattern
External process supervision design - Multi-mode architecture (coupled, supervised, independent, distributed)
Issue #2 critical review - Comprehensive response to community feedback
Adapter selection guide - Clear explanation of TemplateAdapter vs ShowcaseAdapter
Example status clarity - Working vs WIP examples clearly marked

🐛 Bug Fixes

Fixed ProcessRegistry DETS accumulation (1994+ stale entries)
Fixed race condition in concurrent session initialization
Fixed resource cleanup race (wait_for_worker_cleanup checked dead PID instead of actual resources)
Fixed example parameter mismatches
Fixed all ExDoc documentation warnings
Removed catch-all rescue clause (follows "let it crash")

⚡ Performance

100 workers: ~3 seconds initialization
1400-1500 operations/second sustained
DETS cleanup: O(1) vs O(n) process checks

🆕 What's New in v0.4.1

🚀 Enhanced Tool Bridge Functionality

New process_text tool - Text processing with upper, lower, reverse, length operations
New get_stats tool - Real-time adapter and system monitoring with memory/CPU usage
Fixed gRPC tool registration - Resolved async/sync issues with UnaryUnaryCall objects
Automatic session initialization - Sessions created automatically when Python tools register

🔧 Tool Bridge Improvements

Remote tool dispatch - Complete bidirectional communication between Elixir and Python
Missing tool recovery - Added adapter_info, echo, process_text, get_stats to ShowcaseAdapter
Async/sync compatibility - Fixed gRPC stub handling with proper response processing
Enhanced error handling - Better diagnostics for tool registration failures

🆕 What's New in v0.4

🛡️ Enhanced Process Management & Reliability

Persistent process tracking with DETS storage survives BEAM crashes
Automatic orphan cleanup - no more zombie Python processes
Pre-registration pattern - Prevents orphans even during startup crashes
Immediate DETS persistence - No data loss on abrupt termination
Zero-configuration reliability - works out of the box
Production-ready - handles VM crashes, OOM kills, and power failures
See Process Management Documentation for details

👋 Native gRPC Streaming

Real-time progress updates for long-running operations
HTTP/2 multiplexing for concurrent requests
Cancellable operations with graceful stream termination
Built-in health checks and rich error handling

🚀 Binary Serialization for Large Data

Automatic binary encoding for tensors and embeddings > 10KB
5-10x faster than JSON for large numerical arrays
Zero configuration - works automatically
Backward compatible - smaller data still uses JSON
Modern architecture with protocol buffers

📦 High-Performance Design

Efficient binary transfers with protocol buffers
HTTP/2 multiplexing for concurrent operations
Native binary data handling perfect for ML models and images
18-36% smaller message sizes for improved performance

🎯 Comprehensive Showcase Application

Complete example app at examples/snakepit_showcase
Demonstrates all features including binary serialization
Performance benchmarks showing 5-10x speedup
Ready-to-run demos for all Snakepit capabilities

🐍 Python Bridge V2 Architecture

Production-ready packaging with pip install support
Enhanced error handling and robust shutdown management
Console script integration for deployment flexibility
Type checking support with proper py.typed markers

🔄 Bridge Migration & Compatibility

Deprecated V1 Python bridge in favor of V2 architecture
Updated demo implementations using latest best practices
Comprehensive documentation for all bridge implementations
Backward compatibility maintained for existing integrations

🔄 Bidirectional Tool Bridge (NEW)

Cross-language function execution - Call Python from Elixir and vice versa
Transparent tool proxying - Remote functions appear as local functions
Session-scoped isolation - Tools are isolated by session for multi-tenancy
Dynamic discovery - Automatic tool discovery and registration
See Bidirectional Tool Bridge Documentation for details

🔥 Quick Start

# In your mix.exs
def deps do
  [
    {:snakepit, "~> 0.5.1"}
  ]
end

# Configure with gRPC adapter
Application.put_env(:snakepit, :pooling_enabled, true)
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :grpc_config, %{
  base_port: 50051,
  port_range: 100
})
Application.put_env(:snakepit, :pool_config, %{pool_size: 4})

{:ok, _} = Application.ensure_all_started(:snakepit)

# Execute commands with gRPC
{:ok, result} = Snakepit.execute("ping", %{test: true})
{:ok, result} = Snakepit.execute("compute", %{operation: "add", a: 5, b: 3})

# Session-based execution (maintains state)
{:ok, result} = Snakepit.execute_in_session("user_123", "echo", %{message: "hello"})

# Streaming operations for real-time updates
Snakepit.execute_stream("batch_process", %{items: [1, 2, 3]}, fn chunk ->
  IO.puts("Progress: #{chunk["progress"]}%")
end)

📦 Installation

Hex Package

def deps do
  [
    {:snakepit, "~> 0.5.1"}
  ]
end

GitHub (Latest)

def deps do
  [
    {:snakepit, github: "nshkrdotcom/snakepit"}
  ]
end

Requirements

Elixir 1.18+
Erlang/OTP 27+
External runtime (Python 3.8+, Node.js 16+, etc.) depending on adapter

Note: For detailed installation instructions (including platform-specific guides for Ubuntu, macOS, Windows/WSL, Docker, virtual environments, and troubleshooting), see the Complete Installation Guide.

🔧 Quick Setup

Step 1: Install Python Dependencies

For Python/gRPC integration (recommended):

# Using uv (recommended - faster and more reliable)
uv pip install grpcio grpcio-tools protobuf numpy

# Or use pip as fallback
pip install grpcio grpcio-tools protobuf numpy

# Using requirements file with uv
cd deps/snakepit/priv/python
uv pip install -r requirements.txt

# Or with pip
pip install -r requirements.txt

Automated Setup (Recommended):

# Use the setup script (detects uv/pip automatically)
./scripts/setup_python.sh

Manual Setup:

# Create venv and install with uv (fastest)
python3 -m venv .venv
source .venv/bin/activate
uv pip install -r deps/snakepit/priv/python/requirements.txt

# Or with pip
pip install -r deps/snakepit/priv/python/requirements.txt

Step 2: Generate Protocol Buffers

# Generate Python gRPC code
make proto-python

# This creates the necessary gRPC stubs in priv/python/

Step 3: Configure Your Application

Add to your config/config.exs:

config :snakepit,
  # Enable pooling (recommended for production)
  pooling_enabled: true,
  
  # Choose your adapter
  adapter_module: Snakepit.Adapters.GRPCPython,
  
  # Pool configuration
  pool_config: %{
    pool_size: System.schedulers_online() * 2,
    startup_timeout: 10_000,
    max_queue_size: 1000
  },
  
  # gRPC configuration
  grpc_config: %{
    base_port: 50051,
    port_range: 100,
    connect_timeout: 5_000
  },
  
  # Session configuration
  session_config: %{
    ttl: 3600,  # 1 hour default
    cleanup_interval: 60_000  # 1 minute
  }

Step 4: Start Snakepit

In your application supervisor:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
      # Other children...
      {Snakepit.Application, []}
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Or start manually:

{:ok, _} = Application.ensure_all_started(:snakepit)

Step 5: Verify Installation

# Verify Python dependencies
python3 -c "import grpc; print('gRPC installed:', grpc.__version__)"

# Run tests
mix test

# Try an example
elixir examples/grpc_basic.exs

Expected output: Should see gRPC connections and successful command execution.

Troubleshooting: If you see ModuleNotFoundError: No module named 'grpc', the Python dependencies aren't installed. See Installation Guide for help.

Step 6: Create a Custom Adapter (Optional)

For custom Python functionality:

# priv/python/my_adapter.py
from snakepit_bridge.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):
    def __init__(self):
        super().__init__()
        # Initialize your libraries here
        
    async def execute_my_command(self, args):
        # Your custom logic
        result = do_something(args)
        return {"status": "success", "result": result}

Configure it:

# config/config.exs
config :snakepit,
  adapter_module: Snakepit.Adapters.GRPCPython,
  python_adapter: "my_adapter:MyAdapter"

Step 7: Verify Installation

# In IEx
iex> Snakepit.execute("ping", %{})
{:ok, %{"status" => "pong", "timestamp" => 1234567890}}

🎯 Core Concepts

1. Adapters

Adapters define how Snakepit communicates with external processes. They specify:

The runtime executable (python3, node, ruby, etc.)
The bridge script to execute
Supported commands and validation
Request/response transformations

2. Workers

Each worker is a GenServer that:

Owns one external process via Erlang Port
Handles request/response communication
Manages health checks and metrics
Auto-restarts on crashes

3. Pool

The pool manager:

Starts workers concurrently on initialization
Routes requests to available workers
Handles queueing when all workers are busy
Supports session affinity for stateful operations

4. Sessions

Sessions provide:

State persistence across requests
Worker affinity (same session prefers same worker)
TTL-based expiration
Centralized storage in ETS

⚙️ Configuration

Basic Configuration

# config/config.exs
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,  # gRPC-based communication

  # Control Snakepit's internal logging
  # Options: :debug, :info, :warning, :error, :none
  # Set to :warning or :none for clean output in production/demos
  log_level: :info,  # Default (balanced verbosity)

  grpc_config: %{
    base_port: 50051,    # Starting port for gRPC servers
    port_range: 100      # Port range for worker allocation
  },
  pool_config: %{
    pool_size: 8  # Default: System.schedulers_online() * 2
  }

# Optional: Also suppress gRPC library logs
config :logger,
  level: :warning,
  compile_time_purge_matching: [
    [application: :grpc, level_lower_than: :error]
  ]

gRPC Configuration

# gRPC-specific configuration
config :snakepit,
  grpc_config: %{
    base_port: 50051,       # Starting port for gRPC servers
    port_range: 100,        # Port range for worker allocation
    connect_timeout: 5000,  # Connection timeout in ms
    request_timeout: 30000  # Default request timeout in ms
  }

The gRPC adapter automatically assigns unique ports to each worker within the specified range, ensuring isolation and parallel operation.

Advanced Configuration

config :snakepit,
  # Pool settings
  pooling_enabled: true,
  pool_config: %{
    pool_size: 16
  },
  
  # Adapter
  adapter_module: MyApp.CustomAdapter,
  
  # Timeouts (milliseconds)
  pool_startup_timeout: 10_000,      # Max time for worker initialization
  pool_queue_timeout: 5_000,         # Max time in request queue
  worker_init_timeout: 20_000,       # Max time for worker to respond to init
  worker_health_check_interval: 30_000,  # Health check frequency
  worker_shutdown_grace_period: 2_000,   # Grace period for shutdown
  
  # Cleanup settings
  cleanup_retry_interval: 100,       # Retry interval for cleanup
  cleanup_max_retries: 10,          # Max cleanup retries
  
  # Queue management
  pool_max_queue_size: 1000         # Max queued requests before rejection

Runtime Configuration

# Override configuration at runtime
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GenericJavaScript)
Application.stop(:snakepit)
Application.start(:snakepit)

📖 Usage Examples

Running the Examples

Most examples use elixir directly (with Mix.install), but some v0.6.0 demos require the compiled project and use mix run:

Quick Reference

# Basic gRPC examples (use elixir)
elixir examples/grpc_basic.exs                 # Simple ping, echo, add operations
elixir examples/grpc_sessions.exs              # Session management patterns
elixir examples/grpc_streaming.exs             # Streaming data operations
elixir examples/grpc_concurrent.exs            # Concurrent execution (default: 4 workers)
elixir examples/grpc_advanced.exs              # Advanced error handling
elixir examples/grpc_streaming_demo.exs        # Real-time streaming demo

# Bidirectional tool bridge (use elixir)
elixir examples/bidirectional_tools_demo.exs       # Interactive demo
elixir examples/bidirectional_tools_demo_auto.exs  # Auto-run server version

# v0.6.0 demos using compiled modules (use mix run)
mix run examples/threaded_profile_demo.exs                      # Thread profile config
mix run examples/dual_mode/process_vs_thread_comparison.exs    # Profile comparison
mix run examples/dual_mode/hybrid_pools.exs                    # Multiple pool profiles
mix run examples/dual_mode/gil_aware_selection.exs             # Auto Python version detection
mix run examples/lifecycle/ttl_recycling_demo.exs              # TTL worker recycling
mix run examples/monitoring/telemetry_integration.exs          # Telemetry setup

Status: 159/159 tests passing (100%) with default Python! All examples are production-ready.

Note: v0.6.0 feature demos access compiled Snakepit modules (Snakepit.PythonVersion, Snakepit.Compatibility, etc.) and require mix run to work properly.

Working Examples (Fully Functional)

These examples work out-of-the-box with the default ShowcaseAdapter:

# Basic gRPC operations (ping, echo, add)
elixir examples/grpc_basic.exs

# Concurrent execution and pool utilization (default: 4 workers)
elixir examples/grpc_concurrent.exs

# High-concurrency stress test (100 workers)
elixir examples/grpc_concurrent.exs 100

# Bidirectional tool bridge (Elixir ↔ Python tools)
elixir examples/bidirectional_tools_demo.exs

Performance: 1400-1500 ops/sec, 100 workers in ~3 seconds

v0.6.0 Feature Demonstrations

All v0.6.0 examples showcase configuration patterns and best practices:

# Dual-mode architecture
elixir examples/dual_mode/process_vs_thread_comparison.exs    # Side-by-side comparison
elixir examples/dual_mode/hybrid_pools.exs                    # Multiple pools with different profiles
elixir examples/dual_mode/gil_aware_selection.exs             # Automatic Python 3.13+ detection

# Worker lifecycle management
elixir examples/lifecycle/ttl_recycling_demo.exs              # TTL-based automatic recycling

# Monitoring & telemetry
elixir examples/monitoring/telemetry_integration.exs          # Telemetry events setup

# Thread profile (Python 3.13+ free-threading)
elixir examples/threaded_profile_demo.exs                     # Thread profile configuration patterns

Additional Examples

These examples demonstrate advanced features requiring additional tool implementations:

# Session management patterns
elixir examples/grpc_sessions.exs

# Streaming operations
elixir examples/grpc_streaming.exs
elixir examples/grpc_streaming_demo.exs

# Advanced error handling
elixir examples/grpc_advanced.exs

Note: Some advanced examples may require custom adapter tools. See Creating Custom Adapters for implementation details.

Prerequisites: Python dependencies installed (see Installation Guide)

Code Examples

Simple Command Execution

# Basic ping/pong
{:ok, result} = Snakepit.execute("ping", %{})
# => %{"status" => "pong", "timestamp" => 1234567890}

# Computation
{:ok, result} = Snakepit.execute("compute", %{
  operation: "multiply",
  a: 7,
  b: 6
})
# => %{"result" => 42}

# With error handling
case Snakepit.execute("risky_operation", %{threshold: 0.5}) do
  {:ok, result} -> 
    IO.puts("Success: #{inspect(result)}")
  {:error, :worker_timeout} -> 
    IO.puts("Operation timed out")
  {:error, {:worker_error, msg}} -> 
    IO.puts("Worker error: #{msg}")
  {:error, reason} -> 
    IO.puts("Failed: #{inspect(reason)}")
end

Running Scripts and Demos

For short-lived scripts, Mix tasks, or demos that need to execute and exit cleanly, use run_as_script/2:

# In a Mix task or script
Snakepit.run_as_script(fn ->
  # Your code here - all workers will be properly cleaned up on exit
  {:ok, result} = Snakepit.execute("process_data", %{data: large_dataset})
  IO.inspect(result)
end)

# With custom timeout for pool initialization
Snakepit.run_as_script(fn ->
  results = Enum.map(1..100, fn i ->
    {:ok, result} = Snakepit.execute("compute", %{value: i})
    result
  end)
  IO.puts("Processed #{length(results)} items")
end, timeout: 30_000)

This ensures:

The pool waits for all workers to be ready before executing
All Python/external processes are properly terminated on exit
No orphaned processes remain after your script completes

Session-Based State Management

# Create a session with variables
session_id = "analysis_#{UUID.generate()}"

# Initialize session with variables
{:ok, _} = Snakepit.Bridge.SessionStore.create_session(session_id)
{:ok, _} = Snakepit.Bridge.SessionStore.register_variable(
  session_id, 
  "temperature", 
  :float, 
  0.7,
  constraints: %{min: 0.0, max: 1.0}
)

# Execute commands that use session variables
{:ok, result} = Snakepit.execute_in_session(session_id, "generate_text", %{
  prompt: "Tell me about Elixir"
})

# Update variables
:ok = Snakepit.Bridge.SessionStore.update_variable(session_id, "temperature", 0.9)

# List all variables
{:ok, vars} = Snakepit.Bridge.SessionStore.list_variables(session_id)

# Cleanup when done
:ok = Snakepit.Bridge.SessionStore.delete_session(session_id)

ML/AI Workflow Example

# Using SessionHelpers for ML program management
alias Snakepit.SessionHelpers

# Create an ML program/model
{:ok, response} = SessionHelpers.execute_program_command(
  "ml_session_123",
  "create_program",
  %{
    signature: "question -> answer",
    model: "gpt-3.5-turbo",
    temperature: 0.7
  }
)

program_id = response["program_id"]

# Execute the program multiple times
{:ok, result} = SessionHelpers.execute_program_command(
  "ml_session_123", 
  "execute_program",
  %{
    program_id: program_id,
    input: %{question: "What is the capital of France?"}
  }
)

High-Performance Streaming with gRPC

# Configure gRPC adapter for streaming workloads
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :grpc_config, %{
  base_port: 50051,
  port_range: 100
})

# Process large datasets with streaming
Snakepit.execute_stream("process_dataset", %{
  file_path: "/data/large_dataset.csv",
  chunk_size: 1000
}, fn chunk ->
  if chunk["is_final"] do
    IO.puts("Processing complete: #{chunk["total_processed"]} records")
  else
    IO.puts("Progress: #{chunk["progress"]}% - #{chunk["records_processed"]}/#{chunk["total_records"]}")
  end
end)

# ML inference with real-time results
Snakepit.execute_stream("batch_inference", %{
  model_path: "/models/resnet50.pkl",
  images: ["img1.jpg", "img2.jpg", "img3.jpg"]
}, fn chunk ->
  IO.puts("Processed #{chunk["image"]}: #{chunk["prediction"]} (#{chunk["confidence"]}%)")
end)

Parallel Processing

# Process multiple items in parallel across the pool
items = ["item1", "item2", "item3", "item4", "item5"]

tasks = Enum.map(items, fn item ->
  Task.async(fn ->
    Snakepit.execute("process_item", %{item: item})
  end)
end)

results = Task.await_many(tasks, 30_000)

👋 gRPC Communication

Snakepit supports modern gRPC-based communication for advanced streaming capabilities, real-time progress updates, and superior performance.

🚀 Getting Started with gRPC

Upgrade to gRPC (3 Steps):

# Step 1: Install gRPC dependencies
make install-grpc

# Step 2: Generate protocol buffer code
make proto-python

# Step 3: Test the upgrade
elixir examples/grpc_non_streaming_demo.exs

New Configuration (gRPC):

# Replace your adapter configuration with this:
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :grpc_config, %{
  base_port: 50051,
  port_range: 100
})

# ALL your existing API calls work EXACTLY the same
{:ok, result} = Snakepit.execute("ping", %{})
{:ok, result} = Snakepit.execute("compute", %{operation: "add", a: 5, b: 3})

# PLUS you get new streaming capabilities
Snakepit.execute_stream("batch_inference", %{
  batch_items: ["image1.jpg", "image2.jpg", "image3.jpg"]
}, fn chunk ->
  IO.puts("Processed: #{chunk["item"]} - #{chunk["confidence"]}")
end)

📋 gRPC Features

Feature	gRPC Non-Streaming	gRPC Streaming
Standard API	Full support	Full support
Streaming	No	Real-time
HTTP/2 Multiplexing	Yes	Yes
Progress Updates	No	Live Updates
Health Checks	Built-in	Built-in
Error Handling	Rich Status	Rich Status

🔄 Connection lifecycle & port persistence

Snakepit.GRPCWorker persists the actual OS-assigned port after handshake, so registry lookups always return a routable endpoint.
Snakepit.GRPC.BridgeServer asks the worker for its cached GRPC.Stub, only dialing a fresh channel if the worker has not yet published one—eliminating per-call socket churn and cleaning up any fallback channel after use.
Regression guardrails: test/unit/grpc/grpc_worker_ephemeral_port_test.exs ensures the stored port matches the runtime port, and test/snakepit/grpc/bridge_server_test.exs verifies BridgeServer prefers the worker-owned channel.

🧱 Streaming chunk envelope

Every callback receives a map with decoded JSON, "is_final" flag, and optional _metadata fan-out. Binary payloads fall back to Base64 under "raw_data_base64".
Chunk IDs and metadata come straight from ToolChunk, so you can correlate progress across languages.
See test/snakepit/streaming_regression_test.exs for ordering guarantees and final chunk assertions.

🎯 Two gRPC Modes Explained

Mode 1: gRPC Non-Streaming

Use this for: Standard request-response operations

# Standard API for quick operations
{:ok, result} = Snakepit.execute("ping", %{})
{:ok, result} = Snakepit.execute("compute", %{operation: "multiply", a: 10, b: 5})
{:ok, result} = Snakepit.execute("info", %{})

# Session support works exactly the same
{:ok, result} = Snakepit.execute_in_session("user_123", "echo", %{message: "hello"})

When to use:

You want better performance without changing your code
Your operations complete quickly (< 30 seconds)
You don't need progress updates
Standard request-response pattern

Mode 2: gRPC Streaming

Use this for: Long-running operations with real-time progress updates

# NEW streaming API - get results as they complete
Snakepit.execute_stream("batch_inference", %{
  batch_items: ["img1.jpg", "img2.jpg", "img3.jpg"]
}, fn chunk ->
  if chunk["is_final"] do
    IO.puts("All done!")
  else
    IO.puts("Processed: #{chunk["item"]} - #{chunk["confidence"]}")
  end
end)

# Session-based streaming also available
Snakepit.execute_in_session_stream("session_123", "process_large_dataset", %{
  file_path: "/data/huge_file.csv"
}, fn chunk ->
  IO.puts("Progress: #{chunk["progress_percent"]}%")
end)

When to use:

Long-running operations (ML training, data processing)
You want real-time progress updates
Processing large datasets or batches
Better user experience with live feedback

🔧 Setup Instructions

Install gRPC Dependencies

# Install gRPC dependencies
make install-grpc

# Generate protocol buffer code
make proto-python

# Verify with non-streaming demo (same as your existing API)
elixir examples/grpc_non_streaming_demo.exs

# Try new streaming capabilities
elixir examples/grpc_streaming_demo.exs

📄 Complete Examples

Non-Streaming Examples (Standard API)

# Configure gRPC
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :grpc_config, %{base_port: 50051, port_range: 100})

# All your existing code works unchanged
{:ok, result} = Snakepit.execute("ping", %{})
{:ok, result} = Snakepit.execute("compute", %{operation: "add", a: 5, b: 3})
{:ok, result} = Snakepit.execute("info", %{})

# Sessions work exactly the same
{:ok, result} = Snakepit.execute_in_session("session_123", "echo", %{message: "hello"})

# Try it: elixir examples/grpc_non_streaming_demo.exs

Streaming Examples (New Capability)

ML Batch Inference with Real-time Progress:

# Process multiple items, get results as each completes
Snakepit.execute_stream("batch_inference", %{
  model_path: "/models/resnet50.pkl",
  batch_items: ["img1.jpg", "img2.jpg", "img3.jpg"]
}, fn chunk ->
  if chunk["is_final"] do
    IO.puts("All #{chunk["total_processed"]} items complete!")
  else
    IO.puts("#{chunk["item"]}: #{chunk["prediction"]} (#{chunk["confidence"]})")
  end
end)

Large Dataset Processing with Progress:

# Process huge datasets, see progress in real-time
Snakepit.execute_stream("process_large_dataset", %{
  file_path: "/data/huge_dataset.csv",
  chunk_size: 5000
}, fn chunk ->
  if chunk["is_final"] do
    IO.puts("Processing complete: #{chunk["final_stats"]}")
  else
    progress = chunk["progress_percent"]
    IO.puts("Progress: #{progress}% (#{chunk["processed_rows"]}/#{chunk["total_rows"]})")
  end
end)

Session-based Streaming:

# Streaming with session state
session_id = "ml_training_#{user_id}"

Snakepit.execute_in_session_stream(session_id, "distributed_training", %{
  model_config: training_config,
  dataset_path: "/data/training_set"
}, fn chunk ->
  if chunk["is_final"] do
    model_path = chunk["final_model_path"]
    IO.puts("Training complete! Model saved: #{model_path}")
  else
    epoch = chunk["epoch"]
    loss = chunk["train_loss"]
    acc = chunk["val_acc"]
    IO.puts("Epoch #{epoch}: loss=#{loss}, acc=#{acc}")
  end
end)

# Try it: elixir examples/grpc_streaming_demo.exs

🚀 Performance & Benefits

Why Upgrade to gRPC?

gRPC Non-Streaming:

Better performance: HTTP/2 multiplexing, protocol buffers
Built-in health checks: Automatic worker monitoring
Rich error handling: Detailed gRPC status codes
Zero code changes: Drop-in replacement

gRPC Streaming vs Traditional (All Protocols):

Progressive results: Get updates as work completes
Constant memory: Process unlimited data without memory growth
Real-time feedback: Users see progress immediately
Cancellable operations: Stop long-running tasks mid-stream
Better UX: No more "is it still working?" uncertainty

Performance Comparison:

Traditional (blocking):  Submit → Wait 10 minutes → Get all results
gRPC Non-streaming:     Submit → Get result faster (better protocol)
gRPC Streaming:         Submit → Get result 1 → Get result 2 → ...

Memory usage:           Fixed vs Grows with result size vs Constant
User experience:        "Wait..." vs "Wait..." vs Real-time updates
Cancellation:           Kill process vs Kill process vs Graceful stream close

📋 Quick Decision Guide

Choose your mode based on your needs:

Your Situation	Recommended Mode	Why
Quick operations (< 30s)	gRPC Non-Streaming	Low latency, simple API
Want better performance, same API	gRPC Non-Streaming	Drop-in upgrade
Need progress updates	gRPC Streaming	Real-time feedback
Long-running ML tasks	gRPC Streaming	See progress, cancel if needed
Large dataset processing	gRPC Streaming	Memory efficient

Migration path:

gRPC Dependencies

Elixir:

# mix.exs
def deps do
  [
    {:grpc, "~> 0.8"},
    {:protobuf, "~> 0.12"},
    # ... other deps
  ]
end

Python:

# Using uv (recommended)
uv pip install grpcio protobuf grpcio-tools

# Or with pip
pip install 'snakepit-bridge[grpc]'

# Or manually with uv
uv pip install grpcio protobuf grpcio-tools

# Or manually with pip
pip install grpcio protobuf grpcio-tools

Available Streaming Commands

Command	Description	Use Case
`ping_stream`	Heartbeat stream	Testing, monitoring
`batch_inference`	ML model inference	Computer vision, NLP
`process_large_dataset`	Data processing	ETL, analytics
`tail_and_analyze`	Log analysis	Real-time monitoring
`distributed_training`	ML training	Neural networks

For comprehensive gRPC documentation, see README_GRPC.md.

💾 Binary Serialization

Snakepit automatically optimizes large data transfers using binary serialization:

Automatic Optimization

# Small tensor (<10KB) - uses JSON automatically
{:ok, result} = Snakepit.execute("create_tensor", %{
  shape: [10, 10],  # 100 elements = 800 bytes
  name: "small_tensor"
})

# Large tensor (>10KB) - uses binary automatically
{:ok, result} = Snakepit.execute("create_tensor", %{
  shape: [100, 100],  # 10,000 elements = 80KB
  name: "large_tensor"
})

# Performance: 5-10x faster for large data!

ML/AI Use Cases

# Embeddings - automatic binary for large batches
{:ok, embeddings} = Snakepit.execute("generate_embeddings", %{
  texts: ["sentence 1", "sentence 2", ...],  # 100+ sentences
  model: "sentence-transformers/all-MiniLM-L6-v2",
  dimensions: 384
})

# Image processing - binary for pixel data
{:ok, result} = Snakepit.execute("process_images", %{
  images: ["image1.jpg", "image2.jpg"],
  return_tensors: true  # Returns large tensors via binary
})

Performance Benchmarks

Data Size	JSON Time	Binary Time	Speedup
800B	12ms	15ms	0.8x
20KB	45ms	18ms	2.5x
80KB	156ms	22ms	7.1x
320KB	642ms	38ms	16.9x

How It Works

Automatic Detection: Data size calculated on serialization
Threshold: 10KB (10,240 bytes)
Formats:
- Small data: JSON (human-readable, debuggable)
- Large data: Binary (Pickle on Python, ETF on Elixir)
Zero Configuration: Works out of the box

🎯 Showcase Application

Explore all Snakepit features with our comprehensive showcase application:

Quick Start

# Navigate to showcase
cd examples/snakepit_showcase

# Install and run
mix setup
mix demo.all

# Or interactive mode
mix demo.interactive

Available Demos

Basic Operations - Health checks, error handling
Session Management - Stateful operations, worker affinity
Streaming Operations - Real-time progress, chunked data
Concurrent Processing - Parallel execution, pool management
Variable Management - Type system, constraints, validation
Binary Serialization - Performance benchmarks, large data handling
ML Workflows - Complete pipelines with custom adapters

Binary Demo Highlights

mix run -e "SnakepitShowcase.Demos.BinaryDemo.run()"

Shows:

Automatic JSON vs binary detection
Side-by-side performance comparison
Real-world ML embedding examples
Memory efficiency metrics

See examples/snakepit_showcase/README.md for full documentation.

🐍 Python Bridges

For detailed documentation on all Python bridge implementations (V1, V2, Enhanced, gRPC), see the Python Bridges section below.

🔄 Bidirectional Tool Bridge

Snakepit supports transparent cross-language function execution between Elixir and Python:

# Call Python functions from Elixir
{:ok, result} = ToolRegistry.execute_tool(session_id, "python_ml_function", %{data: input})

# Python can call Elixir functions transparently
# result = ctx.call_elixir_tool("parse_json", json_string='{"test": true}')

For comprehensive documentation on the bidirectional tool bridge, see README_BIDIRECTIONAL_TOOL_BRIDGE.md.

🔌 Built-in Adapters

gRPC Python Adapter (Streaming Specialist)

# Configure with gRPC for dedicated streaming and advanced features
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :grpc_config, %{base_port: 50051, port_range: 100})

# Dedicated streaming capabilities
{:ok, _} = Snakepit.execute_stream("batch_inference", %{
  batch_items: ["img1.jpg", "img2.jpg", "img3.jpg"]
}, fn chunk ->
  IO.puts("Processed: #{chunk["item"]} - #{chunk["confidence"]}")
end)

gRPC Features

Native streaming - Progressive results and real-time updates
HTTP/2 multiplexing - Multiple concurrent requests per connection
Built-in health checks - Automatic worker health monitoring
Rich error handling - gRPC status codes with detailed context
Protocol buffers - Efficient binary serialization
Cancellable operations - Stop long-running tasks gracefully
Custom adapter support - Use third-party Python adapters via pool configuration

Custom Adapter Support (v0.3.3+)

The gRPC adapter now supports custom Python adapters through pool configuration:

# Configure with a custom Python adapter (e.g., DSPy integration)
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GRPCPython)
Application.put_env(:snakepit, :pool_config, %{
  pool_size: 4,
  adapter_args: ["--adapter", "snakepit_bridge.adapters.dspy_grpc.DSPyGRPCHandler"]
})

# The adapter can provide custom commands beyond the standard set
{:ok, result} = Snakepit.Python.call("dspy.Predict", %{signature: "question -> answer"})
{:ok, result} = Snakepit.Python.call("stored.predictor.__call__", %{question: "What is DSPy?"})

Available Custom Adapters

snakepit_bridge.adapters.dspy_grpc.DSPyGRPCHandler - DSPy integration for declarative language model programming
- Supports DSPy modules (Predict, ChainOfThought, ReAct, etc.)
- Python API with call, store, retrieve commands
- Automatic signature parsing and field mapping
- Session management for stateful operations

Installation & Usage

# Install gRPC dependencies
make install-grpc

# Generate protocol buffer code  
make proto-python

# Test with streaming demo
elixir examples/grpc_streaming_demo.exs

# Test with non-streaming demo
elixir examples/grpc_non_streaming_demo.exs

JavaScript/Node.js Adapter

# Configure
Application.put_env(:snakepit, :adapter_module, Snakepit.Adapters.GenericJavaScript)

# Additional commands
{:ok, _} = Snakepit.execute("random", %{type: "uniform", min: 0, max: 100})
{:ok, _} = Snakepit.execute("compute", %{operation: "sqrt", a: 16})

ShowcaseAdapter Tools Reference

The default ShowcaseAdapter provides a comprehensive set of tools demonstrating Snakepit capabilities:

Basic Operations

Tool	Description	Parameters	Example
`ping`	Health check / heartbeat	None	`Snakepit.execute("ping", %{})`
`echo`	Echo back all arguments	Any key-value pairs	`Snakepit.execute("echo", %{message: "hello"})`
`add`	Add two numbers	`a` (number), `b` (number)	`Snakepit.execute("add", %{a: 5, b: 3})`
`adapter_info`	Get adapter capabilities	None	`Snakepit.execute("adapter_info", %{})`
`process_text`	Text operations	`text` (string), `operation` (upper/lower/reverse/length)	`Snakepit.execute("process_text", %{text: "hello", operation: "upper"})`
`get_stats`	System & adapter stats	None	`Snakepit.execute("get_stats", %{})`

ML & Data Processing

Tool	Description	Parameters	Example
`ml_analyze_text`	ML-based text analysis	`text` (string)	`Snakepit.execute("ml_analyze_text", %{text: "sample"})`
`process_binary`	Binary data processing	`data` (bytes), `operation` (checksum/etc)	`Snakepit.execute("process_binary", %{data: binary, operation: "checksum"})`

Streaming Operations

Tool	Description	Parameters	Example
`stream_data`	Stream data in chunks	`count` (int), `delay` (float)	`Snakepit.execute_stream("stream_data", %{count: 5, delay: 1.0}, callback)`
`ping_stream`	Streaming heartbeat	`count` (int)	`Snakepit.execute_stream("ping_stream", %{count: 10}, callback)`

Concurrency & Integration

Tool	Description	Parameters	Example
`concurrent_demo`	Concurrent task execution	`task_count` (int)	`Snakepit.execute("concurrent_demo", %{task_count: 3})`
`call_elixir_demo`	Call Elixir tools from Python	`tool_name` (string), tool params	`Snakepit.execute("call_elixir_demo", %{tool_name: "parse_json", ...})`

Usage Example

# Basic operations
{:ok, %{"status" => "pong"}} = Snakepit.execute("ping", %{})
{:ok, %{"result" => 8}} = Snakepit.execute("add", %{a: 5, b: 3})

# Text processing
{:ok, %{"result" => "HELLO", "success" => true}} =
  Snakepit.execute("process_text", %{text: "hello", operation: "upper"})

# System stats
{:ok, stats} = Snakepit.execute("get_stats", %{})
# Returns: %{"adapter" => %{"name" => "ShowcaseAdapter", ...}, "system" => %{...}}

# Streaming
Snakepit.execute_stream("stream_data", %{count: 5, delay: 0.5}, fn chunk ->
  IO.puts("Received chunk: #{inspect(chunk)}")
end)

For custom tools, see Creating Custom Adapters below.

🔧 Creating Custom Adapters

Complete Custom Adapter Example

Here's a real-world example of a data science adapter with session support:

# priv/python/data_science_adapter.py
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from snakepit_bridge.adapters.base import BaseAdapter
from snakepit_bridge.session_context import SessionContext

class DataScienceAdapter(BaseAdapter):
    def __init__(self):
        super().__init__()
        self.models = {}  # Store trained models per session
        
    def set_session_context(self, context: SessionContext):
        """Called when a session context is available."""
        self.session_context = context
        
    async def execute_load_data(self, args):
        """Load data from CSV and store in session."""
        file_path = args.get("file_path")
        if not file_path:
            raise ValueError("file_path is required")
            
        # Load data
        df = pd.read_csv(file_path)
        
        # Store basic info in session variables
        if self.session_context:
            await self.session_context.register_variable(
                "data_shape", "list", list(df.shape)
            )
            await self.session_context.register_variable(
                "columns", "list", df.columns.tolist()
            )
            
        return {
            "rows": len(df),
            "columns": len(df.columns),
            "column_names": df.columns.tolist(),
            "dtypes": df.dtypes.to_dict()
        }
        
    async def execute_preprocess(self, args):
        """Preprocess data with scaling."""
        data = args.get("data")
        target_column = args.get("target")
        
        # Convert to DataFrame
        df = pd.DataFrame(data)
        
        # Separate features and target
        X = df.drop(columns=[target_column])
        y = df[target_column]
        
        # Scale features
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Store scaler parameters in session
        if self.session_context:
            session_id = self.session_context.session_id
            self.models[f"{session_id}_scaler"] = scaler
            
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=0.2, random_state=42
        )
        
        return {
            "train_size": len(X_train),
            "test_size": len(X_test),
            "feature_means": scaler.mean_.tolist(),
            "feature_stds": scaler.scale_.tolist()
        }
        
    async def execute_train_model(self, args):
        """Train a model and store it."""
        model_type = args.get("model_type", "linear_regression")
        hyperparams = args.get("hyperparams", {})
        
        # Import the appropriate model
        if model_type == "linear_regression":
            from sklearn.linear_model import LinearRegression
            model = LinearRegression(**hyperparams)
        elif model_type == "random_forest":
            from sklearn.ensemble import RandomForestRegressor
            model = RandomForestRegressor(**hyperparams)
        else:
            raise ValueError(f"Unknown model type: {model_type}")
            
        # Train model (assume data is passed or stored)
        # ... training logic ...
        
        # Store model in session
        if self.session_context:
            session_id = self.session_context.session_id
            model_id = f"{session_id}_{model_type}"
            self.models[model_id] = model
            
            # Store model metadata as variables
            await self.session_context.register_variable(
                "current_model", "string", model_id
            )
            
        return {
            "model_id": model_id,
            "model_type": model_type,
            "training_complete": True
        }

# Usage in grpc_server.py or your bridge
adapter = DataScienceAdapter()

Simple Command Handler Pattern

For simpler use cases without session management:

# my_simple_adapter.py
from snakepit_bridge import BaseCommandHandler, ProtocolHandler
from snakepit_bridge.core import setup_graceful_shutdown, setup_broken_pipe_suppression

class MySimpleHandler(BaseCommandHandler):
    def _register_commands(self):
        self.register_command("uppercase", self.handle_uppercase)
        self.register_command("word_count", self.handle_word_count)
    
    def handle_uppercase(self, args):
        text = args.get("text", "")
        return {"result": text.upper()}
    
    def handle_word_count(self, args):
        text = args.get("text", "")
        words = text.split()
        return {
            "word_count": len(words),
            "char_count": len(text),
            "unique_words": len(set(words))
        }

def main():
    setup_broken_pipe_suppression()
    
    command_handler = MySimpleHandler()
    protocol_handler = ProtocolHandler(command_handler)
    setup_graceful_shutdown(protocol_handler)
    
    protocol_handler.run()

if __name__ == "__main__":
    main()

Key Benefits of V2 Approach

No sys.path manipulation - proper package imports
Location independent - works from any directory
Production ready - can be packaged and installed
Enhanced error handling - robust shutdown and signal management
Type checking - full IDE support with proper imports

Elixir Adapter Implementation

defmodule MyApp.RubyAdapter do
  @behaviour Snakepit.Adapter
  
  @impl true
  def executable_path do
    System.find_executable("ruby")
  end
  
  @impl true
  def script_path do
    Path.join(:code.priv_dir(:my_app), "ruby/bridge.rb")
  end
  
  @impl true
  def script_args do
    ["--mode", "pool-worker"]
  end
  
  @impl true
  def supported_commands do
    ["ping", "process_data", "generate_report"]
  end
  
  @impl true
  def validate_command("process_data", args) do
    if Map.has_key?(args, :data) do
      :ok
    else
      {:error, "Missing required field: data"}
    end
  end
  
  def validate_command("ping", _args), do: :ok
  def validate_command(cmd, _args), do: {:error, "Unsupported command: #{cmd}"}
  
  # Optional callbacks
  @impl true
  def prepare_args("process_data", args) do
    # Transform args before sending
    Map.update(args, :data, "", &String.trim/1)
  end
  
  @impl true
  def process_response("generate_report", %{"report" => report} = response) do
    # Post-process the response
    {:ok, Map.put(response, "processed_at", DateTime.utc_now())}
  end
  
  @impl true
  def command_timeout("generate_report", _args), do: 120_000  # 2 minutes
  def command_timeout(_command, _args), do: 30_000  # Default 30 seconds
end

External Bridge Script (Ruby Example)

#!/usr/bin/env ruby
# priv/ruby/bridge.rb

require 'grpc'
require_relative 'snakepit_services_pb'

class BridgeHandler
  def initialize
    @commands = {
      'ping' => method(:handle_ping),
      'process_data' => method(:handle_process_data),
      'generate_report' => method(:handle_generate_report)
    }
  end
  
  def run
    STDERR.puts "Ruby bridge started"
    
    loop do
      # gRPC server handles request/response automatically
    end
  end
  
  private
  
  def process_command(request)
    command = request['command']
    args = request['args'] || {}
    
    handler = @commands[command]
    if handler
      result = handler.call(args)
      {
        'id' => request['id'],
        'success' => true,
        'result' => result,
        'timestamp' => Time.now.iso8601
      }
    else
      {
        'id' => request['id'],
        'success' => false,
        'error' => "Unknown command: #{command}",
        'timestamp' => Time.now.iso8601
      }
    end
  rescue => e
    {
      'id' => request['id'],
      'success' => false,
      'error' => e.message,
      'timestamp' => Time.now.iso8601
    }
  end
  
  def handle_ping(args)
    { 'status' => 'ok', 'message' => 'pong' }
  end
  
  def handle_process_data(args)
    data = args['data'] || ''
    { 'processed' => data.upcase, 'length' => data.length }
  end
  
  def handle_generate_report(args)
    # Simulate report generation
    sleep(1)
    { 
      'report' => {
        'title' => args['title'] || 'Report',
        'generated_at' => Time.now.iso8601,
        'data' => args['data'] || {}
      }
    }
  end
end

# Handle signals gracefully
Signal.trap('TERM') { exit(0) }
Signal.trap('INT') { exit(0) }

# Run the bridge
BridgeHandler.new.run

💿 Session Management

Session Store API

alias Snakepit.Bridge.SessionStore

# Create a session
{:ok, session} = SessionStore.create_session("session_123", ttl: 7200)

# Store data in session
:ok = SessionStore.store_program("session_123", "prog_1", %{
  model: "gpt-4",
  temperature: 0.8
})

# Retrieve session data
{:ok, session} = SessionStore.get_session("session_123")
{:ok, program} = SessionStore.get_program("session_123", "prog_1")

# Update session
{:ok, updated} = SessionStore.update_session("session_123", fn session ->
  Map.put(session, :last_activity, DateTime.utc_now())
end)

# Check if session exists
true = SessionStore.session_exists?("session_123")

# List all sessions
session_ids = SessionStore.list_sessions()

# Manual cleanup
SessionStore.delete_session("session_123")

# Get session statistics
stats = SessionStore.get_stats()

Quotas & limits

Configure quotas via :snakepit, :session_store (max_sessions, max_programs_per_session, max_global_programs); defaults guard against unbounded growth while allowing :infinity overrides for trusted deployments.
Attempting to exceed a quota returns tagged errors such as {:error, :session_quota_exceeded} or {:error, {:program_quota_exceeded, session_id}} so callers can surface actionable messages.
Session state lives in :protected ETS tables owned by the SessionStore process—access it via the public API rather than touching ETS directly.
Regression coverage lives in test/unit/bridge/session_store_test.exs, which exercises per-session quotas, global quotas, and reuse of existing program slots.

Global Program Storage

# Store programs accessible by any worker
:ok = SessionStore.store_global_program("template_1", %{
  type: "qa_template",
  prompt: "Answer the following question: {question}"
})

# Retrieve from any worker
{:ok, template} = SessionStore.get_global_program("template_1")

📊 Monitoring & Telemetry

Snakepit provides a comprehensive distributed telemetry system that enables full observability across your Elixir cluster and Python workers. All events flow through Elixir's standard :telemetry library.

📖 See TELEMETRY.md for complete documentation.

Quick Example

# Monitor Python tool execution
:telemetry.attach(
  "my-app-monitor",
  [:snakepit, :python, :call, :stop],
  fn _event, %{duration: duration}, metadata, _ ->
    duration_ms = duration / 1_000_000
    Logger.info("Python call completed",
      command: metadata.command,
      duration_ms: duration_ms,
      worker_id: metadata.worker_id
    )
  end,
  nil
)

Event Catalog (40+ events)

Infrastructure Events:

[:snakepit, :pool, :worker, :spawned] - Worker ready and connected
[:snakepit, :pool, :worker, :terminated] - Worker terminated
[:snakepit, :pool, :status] - Periodic pool status
[:snakepit, :session, :created|destroyed] - Session lifecycle

Python Execution Events (folded from Python):

[:snakepit, :python, :call, :start|stop|exception] - Command execution
[:snakepit, :python, :tool, :execution, :*] - Tool execution
[:snakepit, :python, :memory, :sampled] - Resource metrics

gRPC Bridge Events:

[:snakepit, :grpc, :call, :start|stop|exception] - gRPC calls
[:snakepit, :grpc, :stream, :*] - Streaming operations
[:snakepit, :grpc, :connection, :*] - Connection health

Python API

from snakepit_bridge import telemetry

# Automatic timing with span
with telemetry.span("tool.execution", {"tool": "my_tool"}):
    result = expensive_operation()

# Custom metrics
telemetry.emit("tool.result_size", {"bytes": len(result)})

Integration with Metrics Systems

Works seamlessly with:

Prometheus - telemetry_metrics_prometheus
StatsD - telemetry_metrics_statsd
OpenTelemetry - opentelemetry_telemetry
Custom handlers - Your own GenServer aggregators

Pool Statistics

stats = Snakepit.get_stats()
# Returns:
# %{
#   workers: 8,          # Total workers
#   available: 6,        # Available workers
#   busy: 2,            # Busy workers
#   requests: 1534,     # Total requests
#   queued: 0,          # Currently queued
#   errors: 12,         # Total errors
#   queue_timeouts: 3,  # Queue timeout count
#   pool_saturated: 0   # Saturation rejections
# }

🏗️ Architecture Deep Dive

Component Overview

┌───────────────────────────────────────────────────────┐
│                    Snakepit Application               │
├───────────────────────────────────────────────────────┤
│                                                       │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐ │
│  │    Pool     │  │ SessionStore │  │ProcessRegistry│ │
│  │  Manager    │  │   (ETS)      │  │ (ETS + DETS) │ │
│  └──────┬──────┘  └──────────────┘  └───────────────┘ │
│         │                                             │
│  ┌──────▼────────────────────────────────────────────┐│
│  │            WorkerSupervisor (Dynamic)             ││
│  └──────┬────────────────────────────────────────────┘│
│         │                                             │
│  ┌──────▼──────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Worker    │  │   Worker     │  │   Worker     │  │
│  │  Starter    │  │  Starter     │  │  Starter     │  │
│  │(Supervisor) │  │(Supervisor)  │  │(Supervisor)  │  │
│  └──────┬──────┘  └───────┬──────┘  └───────┬──────┘  │
│         │                 │                 │         │
│  ┌──────▼──────┐  ┌───────▼──────┐  ┌───────▼──────┐  │
│  │   Worker    │  │   Worker     │  │   Worker     │  │
│  │ (GenServer) │  │ (GenServer)  │  │ (GenServer)  │  │
│  └──────┬──────┘  └───────┬──────┘  └───────┬──────┘  │
│         │                 │                 │         │
└─────────┼─────────────────┼─────────────────┼─────────┘
          │                 │                 │
    ┌─────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐
    │  External  │    │  External  │    │  External  │
    │  Process   │    │  Process   │    │  Process   │
    │  (Python)  │    │  (Node.js) │    │   (Ruby)   │
    └────────────┘    └────────────┘    └────────────┘

Key Design Decisions

Concurrent Initialization: Workers start in parallel using Task.async_stream
Permanent Wrapper Pattern: Worker.Starter supervises Workers for auto-restart
Centralized State: All session data in ETS, workers are stateless
Registry-Based: O(1) worker lookups and reverse PID lookups
gRPC Communication: HTTP/2 protocol with streaming support
Persistent Process Tracking: ProcessRegistry uses DETS for crash-resistant tracking

Process Lifecycle

Startup:
- Pool manager starts
- Concurrently spawns N workers via WorkerSupervisor
- Each worker starts its external process
- Workers send init ping and register when ready
Request Flow:
- Client calls Snakepit.execute/3
- Pool finds available worker (with session affinity if applicable)
- Worker sends request to external process
- External process responds
- Worker returns result to client
Crash Recovery:
- Worker crashes → Worker.Starter restarts it automatically
- External process dies → Worker detects and crashes → restart
- Pool crashes → Supervisor restarts entire pool
- BEAM crashes → ProcessRegistry cleans orphans on next startup
Shutdown:
- Pool manager sends shutdown to all workers
- Workers close ports gracefully (SIGTERM)
- ApplicationCleanup ensures no orphaned processes (SIGKILL)

⚡ Performance

gRPC Performance Benchmarks

Configuration: 16 workers, gRPC Python adapter
Hardware: 8-core CPU, 32GB RAM

gRPC Performance:

Startup Time:
- Sequential: 16 seconds (1s per worker)
- Concurrent: 1.2 seconds (13x faster)

Throughput (gRPC Non-Streaming):
- Simple computation: 75,000 req/s
- ML inference: 12,000 req/s
- Session operations: 68,000 req/s

Latency (p99, gRPC):
- Simple computation: < 1.2ms
- ML inference: < 8ms
- Session operations: < 0.6ms

Streaming Performance:
- Throughput: 250,000 chunks/s
- Memory usage: Constant (streaming)
- First chunk latency: < 5ms

Connection overhead:
- Initial connection: 15ms
- Reconnection: 8ms
- Health check: < 1ms

Optimization Tips

Pool Size: Start with System.schedulers_online() * 2
Queue Size: Monitor pool_saturated errors and adjust
Timeouts: Set appropriate timeouts per command type
Session TTL: Balance memory usage vs cache hits
Health Checks: Increase interval for stable workloads

💾 Binary Serialization (Detailed)

Overview

Snakepit v0.3+ includes automatic binary serialization for large data transfers, providing significant performance improvements for ML/AI workloads that involve tensors, embeddings, and other numerical arrays.

How It Works

Automatic Detection: When variable data exceeds 10KB, Snakepit automatically switches from JSON to binary encoding
Type Support: Currently optimized for tensor and embedding variable types
Zero Configuration: No code changes required - it just works
Protocol: Uses Erlang's native binary format (ETF) on Elixir side and Python's pickle on Python side

Performance Benefits

# Example: 1000x1000 tensor (8MB of float data)
# JSON encoding: ~500ms
# Binary encoding: ~50ms (10x faster!)

# Create a large tensor
{:ok, _} = Snakepit.execute_in_session("ml_session", "create_tensor", %{
  shape: [1000, 1000],
  fill_value: 0.5
})

# The tensor is automatically stored using binary serialization
# Retrieval is also optimized
{:ok, tensor} = Snakepit.execute_in_session("ml_session", "get_variable", %{
  name: "large_tensor"
})

Size Threshold

The 10KB threshold (10,240 bytes) is optimized for typical workloads:

Below 10KB: JSON encoding (better for debugging, human-readable)
Above 10KB: Binary encoding (better for performance)

Python Usage

# In your Python adapter
from snakepit_bridge import SessionContext

class MLAdapter:
    def process_embeddings(self, ctx: SessionContext, batch_size: int):
        # Generate large embeddings (e.g., 512-dimensional)
        embeddings = np.random.randn(batch_size, 512).tolist()
        
        # This automatically uses binary serialization if > 10KB
        ctx.register_variable("batch_embeddings", "embedding", embeddings)
        
        # Retrieval also handles binary data transparently
        stored = ctx["batch_embeddings"]
        return {"shape": [len(stored), len(stored[0])]}

Technical Details

Binary Format Specification

Tensor Type:
- Metadata (JSON): {"shape": [dims...], "dtype": "float32", "binary_format": "pickle/erlang_binary"}
- Binary data: Serialized flat array of values
Embedding Type:
- Metadata (JSON): {"shape": [length], "dtype": "float32", "binary_format": "pickle/erlang_binary"}
- Binary data: Serialized array of float values

Protocol Buffer Changes

The following fields support binary data:

Variable.binary_value: Stores large variable data
SetVariableRequest.binary_value: Sets variable with binary data
RegisterVariableRequest.initial_binary_value: Initial binary value
BatchSetVariablesRequest.binary_updates: Batch binary updates
ExecuteToolRequest.binary_parameters: Binary tool parameters

Best Practices

Variable Types: Always use proper types (tensor, embedding) for large numerical data
Batch Operations: Use batch updates for multiple large variables to minimize overhead
Memory Management: Binary data is held in memory - monitor usage for very large datasets
Compatibility: Binary format is internal - use standard types when sharing data externally

Limitations

Type Support: Currently only tensor and embedding types use binary serialization
Format Lock-in: Binary data uses platform-specific formats (ETF/pickle)
Debugging: Binary data is not human-readable in logs/inspection

🔧 Troubleshooting

Common Issues

Orphaned Python Processes

# Check for orphaned processes
ps aux | grep grpc_server.py

# Verify ProcessRegistry is cleaning up
Snakepit.Pool.ProcessRegistry.get_stats()

# Check DETS file location
ls -la priv/data/process_registry.dets

# See detailed documentation
# README_PROCESS_MANAGEMENT.md

Workers Not Starting

# Check adapter configuration
adapter = Application.get_env(:snakepit, :adapter_module)
adapter.executable_path()  # Should return valid path
File.exists?(adapter.script_path())  # Should return true

# Check logs for errors
Logger.configure(level: :debug)

Port Exits

# Enable port tracing
:erlang.trace(Process.whereis(Snakepit.Pool.Worker), true, [:receive, :send])

# Check external process logs
# Python: Add logging to bridge script
# Node.js: Check stderr output

Memory Leaks

# Monitor ETS usage
:ets.info(:snakepit_sessions, :memory)

# Check for orphaned processes
Snakepit.Pool.ProcessRegistry.get_stats()

# Force cleanup
Snakepit.Bridge.SessionStore.cleanup_expired_sessions()

Debug Mode

# Enable debug logging
Logger.configure(level: :debug)

# Trace specific worker
:sys.trace(Snakepit.Pool.Registry.via_tuple("worker_1"), true)

# Get internal state
:sys.get_state(Snakepit.Pool)

📚 Additional Documentation

Telemetry & Observability - Comprehensive telemetry system guide
Testing Guide - How to run and write tests
Unified gRPC Bridge - Stage 0, 1, and 2 implementation details
Bidirectional Tool Bridge - Cross-language function execution between Elixir and Python
Process Management - Persistent tracking and orphan cleanup
gRPC Communication - Streaming and non-streaming gRPC details
Python Bridge Implementations - See sections above for V1, V2, Enhanced, and gRPC bridges

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone the repo
git clone https://github.com/nshkrdotcom/snakepit.git
cd snakepit

# Install dependencies
mix deps.get

# Run tests
mix test

# Run example scripts
elixir examples/v2/session_based_demo.exs
elixir examples/javascript_grpc_demo.exs

# Check code quality
mix format --check-formatted
mix dialyzer

Running Tests

# All tests
mix test

# With coverage
mix test --cover

# Specific test
mix test test/snakepit_test.exs:42

📄 License

Snakepit is released under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

Inspired by the need for reliable ML/AI integrations in Elixir
Built on battle-tested OTP principles
Special thanks to the Elixir community

📊 Development Status

v0.5.1 (Current Release)

Worker pool scaling fixed - Reliably scales to 250+ workers (previously ~105 limit)
Thread explosion resolved - Fixed fork bomb from Python scientific libraries
Dynamic port allocation - OS-assigned ports eliminate collision races
Batched startup - Configurable batching prevents resource exhaustion
New diagnostic tools - Added mix diagnose.scaling for bottleneck analysis
Enhanced configuration - Thread limiting and resource management improvements

v0.5.0

DSPy integration removed - Clean architecture separation achieved
Test infrastructure enhanced - 89% increase in test coverage (27→51 tests)
Code cleanup complete - Significant dead code removed
Python SessionContext streamlined - Simplified implementation
Supertester foundation - Phase 1 complete with deterministic testing
gRPC streaming bridge - Full implementation with HTTP/2 multiplexing
Comprehensive documentation - All features well-documented

Roadmap

Complete Supertester conformance (Phases 2-4)
Enhanced streaming operations and cancellation
Additional language adapters (Ruby, R, Go)
Advanced telemetry and monitoring features
Distributed worker pools

📚 Resources

Hex Package
API Documentation
GitHub Repository
Example Projects
Telemetry & Observability Guide
gRPC Bridge Documentation
Python Bridge Documentation - See sections above

Made with ❤️ by NSHkr

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
.github/workflows		.github/workflows
.kiro/specs		.kiro/specs
assets		assets
bench		bench
config		config
docs		docs
examples		examples
guides		guides
lib		lib
priv		priv
scripts		scripts
specs		specs
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
DIAGS.md		DIAGS.md
DIAGS2.md		DIAGS2.md
FINAL_HONEST_STATUS.md		FINAL_HONEST_STATUS.md
LICENSE		LICENSE
LOG_LEVEL_CONFIGURATION.md		LOG_LEVEL_CONFIGURATION.md
Makefile		Makefile
README.md		README.md
README_BIDIRECTIONAL_TOOL_BRIDGE.md		README_BIDIRECTIONAL_TOOL_BRIDGE.md
README_GRPC.md		README_GRPC.md
README_PROCESS_MANAGEMENT.md		README_PROCESS_MANAGEMENT.md
README_TESTING.md		README_TESTING.md
README_UNIFIED_GRPC_BRIDGE.md		README_UNIFIED_GRPC_BRIDGE.md
STATUS.md		STATUS.md
TELEMETRY.md		TELEMETRY.md
fix_pool_handlers.py		fix_pool_handlers.py
mix.exs		mix.exs
mix.lock		mix.lock
pytest.ini		pytest.ini
remaining_handlers.txt		remaining_handlers.txt
test_bidirectional.py		test_bidirectional.py
test_python.sh		test_python.sh

License

nshkrdotcom/snakepit

Folders and files

Latest commit

History

Repository files navigation