- Baltimore MD
Stars
TPC-H benchmark data generation in pure Rust
The Auron accelerator for distributed computing framework (e.g., Spark) leverages native vectorized execution to accelerate query processing
A library for building efficient set-membership filters and dictionaries based on the Satisfiability problem.
A composable and fully extensible C++ execution engine library for data management systems.
Apache DataFusion Comet Spark Accelerator
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
A configuration framework that enhances Claude Code with specialized commands, cognitive personas, and development methodologies.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Brotli4j provides Brotli compression and decompression for Java.
A library that provides useful extensions to Apache Spark and PySpark.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
pyspark methods to enhance developer productivity 📣 👯 🎉
FHIR Core / OpenSRP 2 is a Kotlin application for delivering offline-capable, mobile-first healthcare project implementations from local community to national and international scale using FHIR and…
A collection of tools for extracting FHIR resources and analytics services on top of that data.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
Metadata driven Databricks Lakeflow Declarative Pipelines framework for bronze/silver pipelines
Apache Beam is a unified programming model for Batch and Streaming data processing.