TransBeamer

The TransBeamer library provides utilities for reading and writing data of various formats in Apache Beam pipelines, populating Avro-based PCollections as interim values.

The goal of the library is to make it easy for Beam pipelines to read in any data format into a PCollection backed by elements described by Avro schema. Then, when the pipeline is done processing data, make it easy to write Avro-backed PCollections back out to a variety of formats.

Features

Multiple Format Support: Read and write CSV, Avro, Parquet, and GCP Pubsub and others
Consistent Reading/Writing API: One API for multiple formats
Extensible Format Support: Write your own formats as needed
Avro-Centric: Uses Avro as the intermediate data format for strong schema, coder support

Why?

Managing data as it flows through a Beam pipeline is a chore. This is especially true when you end up with custom POJOs intermixed with other types. The trade-offs between different DTO designs are not clear. This library exists to make one potential solution easy to implement: use Avro for every DTO.

Describing the objects within your pipeline with Avro Schema has a lot of benefits: broad tool support, strong typing support, builder and immutability pattern support, and many others. It serves as a good common denominator in a larger, stratified data format world.

Installation

Maven

<dependency>
    <groupId>com.sanuscorp</groupId>
    <artifactId>transbeamer</artifactId>
    <version>2.1.0</version>
</dependency>

Gradle

implementation 'com.sanuscorp:transbeamer:2.1.0'

Quick Start

Out-of-the-Box Examples

To run live examples, clone this repository ...

git clone https://github.com/sanuscorp/transbeamer

and run the following (with a Java 21+ JDK installed):

cd transbeamer
./gradlew welcome

... to see details on running the included examples.

The High-Level API

The com.sansuscorp.transbeamer.TransBeamer class contains static methods for creating reader (.newReader(...)) and writer (.newWriter(...) PTransform instances.

In all cases, the first argument to provide is a FileFormat or DataFormat implementation to specify the format to read or write from. Use static methods on the provided implementations to specify the details of the format to use:

Data Format	Format Class	Description
✅ CSV	`CsvFormat`	Comma-separated values
✅ Avro	`AvroFormat`	Apache Avro binary format
✅ Parquet	`ParquetFormat`	Columnar storage format
✅ NDJson	`NDJsonFormat`	Newline-delimited JSON
✅ GCP Pubsub	`PubsubFormat`	Google Cloud Platform Pubsub Topics/Subscriptions

When reading or writing FileIO-based formats (e.g., CSV, Avro, Parquet, etc), the second argument is the location to read or write the files from. When reading or writing other formats (e.g., Pubsub), the location is specified in the format itself. The final argument is the Avro-generated Class instance that will be used in the related PCollection.

The API is straightforward to demonstrate in a few examples.

Example: Reading CSV Data

Describe your data as an Avro schema. For instance:

{
  "namespace": "com.sanuscorp.transbeamer.samples.avro",
  "type": "record",
  "name": "StarWarsMovie",
  "fields": [
    {"name": "year", "type": "int"},
    {"name": "title", "type": "string"},
    {"name": "rating", "type": "double"}
  ],
  "javaAnnotation": [
    "org.apache.beam.sdk.schemas.annotations.DefaultSchema(org.apache.beam.sdk.extensions.avro.schemas.AvroRecordSchema.class)",
    "org.apache.beam.sdk.coders.DefaultCoder(org.apache.beam.sdk.extensions.avro.coders.AvroCoder.class)"
  ]
}

Use TransBeamer to create a new reader configured to read CSV from the local "input" directory with a file prefix of "starwars":

    Pipeline pipeline = Pipeline.create();
    PCollection<StarWarsMovie> movies = pipeline.apply(
        TransBeamer.newReader(
            CsvFormat.create(),      // Choose the format here
            "input",                 // Where to read the CSV files from
            StarWarsMovie.class      // The Avro class to use
        ).withFilePrefix("starwars") // A prefix to filter the files on
    );

To read data in other formats, use one of the other DataFormat implementations (i.e. Parquet.create()) when creating the reader.

Example: Writing Parquet Data

Assuming you already have a PCollection backed by Avro objects, writing them out is a straight-forward affair:

    PCollection<StarWarsMovie> movies = /* created elsewhere */;
    
    movies.apply(
        TransBeamer.newWriter(
            ParquetFormat.create(),  // The format to write
            "build",                 // Where to write the files
            StarWarsMovie.class      // The avro-backing class writing from
        )
    );

Example: Writing CSV Data to GCP Pubsub

TransBeamer can also transform your data to and from Pubsub. Create a Topic in your GCP Project, and then ...

    // Read in CSV files
    Pipeline pipeline = Pipeline.create();
    final PCollection<StarWarsMovie> movies = pipeline.apply(
        TransBeamer.newReader(
            CsvFormat.create(), "input", StarWarsMovie.class
        )
    );
    
    // Write each entry as a Pubsub message
    final String myTopic = "/projects/my-project/topics/my-topic-name";
    movies.apply(
        TransBeamer.newWriter(
            PubsubFormat.withTopic(myTopic), // The topic to write to
            StarWarsMovie.class              // The Avro class of "movies"
        )
    );

Requirements

Java >= 11
Apache Beam >= 2.63
Apache Avro >= 1.11.X

Building from Source

Building requires Java 21+.

git clone https://github.com/sanuscorp/transbeamer.git
cd transbeamer
./gradlew build

Running Lint & Tests

./gradlew check
...
./gradlew test
...

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License—see the LICENSE file for details.

Support

For questions, issues, or contributions, please:

Open an issue on GitHub Issues

About Sanus Software & Services

TransBeamer is developed and maintained by Sanus Software & Services.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
gradle		gradle
lib		lib
samples		samples
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TransBeamer

Features

Why?

Installation

Maven

Gradle

Quick Start

Out-of-the-Box Examples

The High-Level API

Example: Reading CSV Data

Example: Writing Parquet Data

Example: Writing CSV Data to GCP Pubsub

Requirements

Building from Source

Running Lint & Tests

Contributing

License

Support

About Sanus Software & Services

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sanuscorp/transbeamer

Folders and files

Latest commit

History

Repository files navigation

TransBeamer

Features

Why?

Installation

Maven

Gradle

Quick Start

Out-of-the-Box Examples

The High-Level API

Example: Reading CSV Data

Example: Writing Parquet Data

Example: Writing CSV Data to GCP Pubsub

Requirements

Building from Source

Running Lint & Tests

Contributing

License

Support

About Sanus Software & Services

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages