Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: apache spark integration#46

Merged
vim89 merged 22 commits intomainfrom
feat/spark-integration
Dec 11, 2025
Merged

feat: apache spark integration#46
vim89 merged 22 commits intomainfrom
feat/spark-integration

Conversation

@vim89
Copy link
Member

@vim89 vim89 commented Dec 11, 2025

Added

  • Apache Spark integration module (toon4s-spark) for DataFrame/Dataset TOON encoding
  • Schema alignment analyzer with benchmark-based scoring for production safety
  • Adaptive chunking to optimize prompt tax based on dataset size
  • Delta Lake CDC integration for real-time streaming
  • Iceberg time travel support for historical snapshot analysis
  • Production monitoring with health checks and telemetry
  • LLM client abstraction compatible with llm4s patterns

Changed

  • CI workflows updated for multi-module publishing
  • Unified versioning across toon4s-core and toon4s-spark

Removed

  • Old toon4s-compare module

vim89 added 14 commits December 11, 2025 13:54
- Remove var savingsPercent from ToonMetrics case class, make it a computed def
- Remove debugging println statements from production code
- Remove unused scala.util.Try import from ToonUDFs
- Rename HeadroomPercent.None to NoHeadroom to avoid shadowing Option.None
- Update documentation to clarify unsupported type handling
- Add SparkDatasetOps with Dataset[T] extension methods
- Provide compile-time type safety via Spark Encoders
- Support toToon(), toonMetrics(), fromToon() for typed Datasets
- Automatic key inference from type parameter
- Comprehensive scaladoc with usage examples
- Dataset tests created (WIP: needs Spark implicits refactor)

This enables users to work with type-safe Datasets instead of untyped DataFrames
while maintaining all TOON encoding capabilities.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive Apache Spark integration for the TOON format library, enabling efficient DataFrame/Dataset encoding for LLM processing with a focus on production readiness and monitoring.

Summary: The integration adds Spark-specific TOON encoding capabilities with intelligent schema alignment detection, adaptive chunking to optimize token costs, and production monitoring utilities. It includes Delta Lake CDC streaming and Apache Iceberg time travel support for temporal analytics use cases.

Key Changes:

  • Spark DataFrame/Dataset to TOON conversion with chunking support and error handling
  • LLM client abstraction aligned with llm4s patterns for future compatibility
  • Delta Lake Change Data Feed and Apache Iceberg time travel integrations
  • Production monitoring framework with health checks, telemetry, and readiness reports

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated no comments.

Show a summary per file
File Description
spark-integration/src/test/scala/io/toonformat/toon4s/spark/llm/LlmClientTest.scala Test suite for LLM client abstractions including messages, conversations, completions, and error handling
spark-integration/src/test/scala/io/toonformat/toon4s/spark/ToonMetricsTest.scala Tests for token estimation, metrics calculation, and cost savings analysis
spark-integration/src/test/scala/io/toonformat/toon4s/spark/SparkToonOpsTest.scala Tests for DataFrame-TOON conversion, chunking, round-trip encoding/decoding
spark-integration/src/test/scala/io/toonformat/toon4s/spark/SparkJsonInteropTest.scala Tests for Row-JsonValue conversion with various data types
spark-integration/src/main/scala/.../monitoring/ToonMonitoring.scala Production monitoring utilities including health assessment and telemetry collection
spark-integration/src/main/scala/.../llm/package.scala LLM integration type aliases and helpers aligned with llm4s patterns
spark-integration/src/main/scala/.../llm/MockLlmClient.scala Mock LLM client implementation for testing
spark-integration/src/main/scala/.../llm/Message.scala Message types for LLM conversations (System, User, Assistant)
spark-integration/src/main/scala/.../llm/LlmError.scala Error hierarchy for LLM operations with retry logic
spark-integration/src/main/scala/.../llm/LlmClient.scala LLM client trait and configuration types
spark-integration/src/main/scala/.../llm/Completion.scala Completion response types with token usage tracking
spark-integration/src/main/scala/.../integrations/IcebergTimeTravel.scala Apache Iceberg time travel integration for historical snapshots
spark-integration/src/main/scala/.../integrations/DeltaLakeCDC.scala Delta Lake CDC integration for real-time streaming
spark-integration/src/main/scala/.../error/SparkToonError.scala Error types for Spark integration failures
spark-integration/src/main/scala/.../ToonUDFs.scala User-defined functions for Spark SQL
spark-integration/src/main/scala/.../ToonMetrics.scala Token metrics and savings calculations
spark-integration/src/main/scala/.../ToonAlignmentAnalyzer.scala Schema alignment detection based on benchmark data
spark-integration/src/main/scala/.../SparkToonOps.scala Core DataFrame extension methods for TOON operations
spark-integration/src/main/scala/.../SparkJsonInterop.scala Conversion utilities between Spark Row and JsonValue
spark-integration/src/main/scala/.../SparkDatasetOps.scala Type-safe Dataset[T] extension methods
spark-integration/src/main/scala/.../LlmClient.scala.old Legacy LLM client (marked as .old file)
spark-integration/src/main/scala/.../AdaptiveChunking.scala Adaptive chunking strategy to optimize prompt tax
spark-integration/examples/README.md Comprehensive examples documentation with production checklist
spark-integration/examples/ProductionMonitoringExample.scala Example demonstrating pre-deployment validation and monitoring
spark-integration/examples/IcebergTrendAnalysisExample.scala Example for quarterly trend analysis using Iceberg time travel
spark-integration/examples/DatabricksStreamingExample.scala Example for real-time fraud detection with Delta CDC
spark-integration/README.md Main documentation for Spark integration features and usage
build.sbt Build configuration adding spark-integration module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

Benchmark Mode Score Units
decode_list thrpt 390.49769240640103 ops/ms
decode_nested thrpt 280.8416838286611 ops/ms
decode_tabular thrpt 418.533224240633 ops/ms
encode_object thrpt 285.6996814876345 ops/ms

Dataset encoder resolution doesn't work in test framework due to implicits scope.
Production code works fine. Added compile-time verification tests instead.
Added component diagrams, data flow, and design principles.
Includes benchmark-driven decisions and real-world use cases.
Add toon4s-spark to CI checks and release workflow.
Both modules use unified versioning via sbt-dynver.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

Benchmark Mode Score Units
decode_list thrpt 395.71916661211014 ops/ms
decode_nested thrpt 278.3191083023976 ops/ms
decode_tabular thrpt 415.63900994662856 ops/ms
encode_object thrpt 285.75445596779355 ops/ms

@vim89 vim89 merged commit d1c2ae1 into main Dec 11, 2025
16 checks passed
@vim89 vim89 deleted the feat/spark-integration branch December 11, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants