The TransBeamer library provides utilities for reading and writing data of various formats in Apache Beam pipelines, populating Avro-based PCollections as interim values.
The goal of the library is to make it easy for Beam pipelines to read in any
data format into a PCollection backed by elements described by Avro
schema. Then, when the pipeline is done processing data, make it easy to write
Avro-backed PCollections back out to a variety of formats.
- Multiple Format Support: Read and write CSV, Avro, Parquet, and GCP Pubsub and others
- Consistent Reading/Writing API: One API for multiple formats
- Extensible Format Support: Write your own formats as needed
- Avro-Centric: Uses Avro as the intermediate data format for strong schema, coder support
Managing data as it flows through a Beam pipeline is a chore. This is especially true when you end up with custom POJOs intermixed with other types. The trade-offs between different DTO designs are not clear. This library exists to make one potential solution easy to implement: use Avro for every DTO.
Describing the objects within your pipeline with Avro Schema has a lot of benefits: broad tool support, strong typing support, builder and immutability pattern support, and many others. It serves as a good common denominator in a larger, stratified data format world.
<dependency>
<groupId>com.sanuscorp</groupId>
<artifactId>transbeamer</artifactId>
<version>2.1.0</version>
</dependency>implementation 'com.sanuscorp:transbeamer:2.1.0'To run live examples, clone this repository ...
git clone https://github.com/sanuscorp/transbeamerand run the following (with a Java 21+ JDK installed):
cd transbeamer
./gradlew welcome
... to see details on running the included examples.The com.sansuscorp.transbeamer.TransBeamer class contains static methods for
creating reader (.newReader(...)) and writer (.newWriter(...) PTransform
instances.
In all cases, the first argument to provide is a FileFormat or DataFormat
implementation to specify the format to read or write from. Use static methods
on the provided implementations to specify the details of the format to use:
| Data Format | Format Class | Description |
|---|---|---|
| ✅ CSV | CsvFormat |
Comma-separated values |
| ✅ Avro | AvroFormat |
Apache Avro binary format |
| ✅ Parquet | ParquetFormat |
Columnar storage format |
| ✅ NDJson | NDJsonFormat |
Newline-delimited JSON |
| ✅ GCP Pubsub | PubsubFormat |
Google Cloud Platform Pubsub Topics/Subscriptions |
When reading or writing FileIO-based formats (e.g., CSV, Avro, Parquet, etc),
the second argument is the location to read or write the files from. When
reading or writing other formats (e.g., Pubsub), the location is specified in
the format itself. The final argument is the Avro-generated Class instance
that will be used in the related PCollection.
The API is straightforward to demonstrate in a few examples.
Describe your data as an Avro schema. For instance:
{
"namespace": "com.sanuscorp.transbeamer.samples.avro",
"type": "record",
"name": "StarWarsMovie",
"fields": [
{"name": "year", "type": "int"},
{"name": "title", "type": "string"},
{"name": "rating", "type": "double"}
],
"javaAnnotation": [
"org.apache.beam.sdk.schemas.annotations.DefaultSchema(org.apache.beam.sdk.extensions.avro.schemas.AvroRecordSchema.class)",
"org.apache.beam.sdk.coders.DefaultCoder(org.apache.beam.sdk.extensions.avro.coders.AvroCoder.class)"
]
}Use TransBeamer to create a new reader configured to read CSV from the local
"input" directory with a file prefix of "starwars":
Pipeline pipeline = Pipeline.create();
PCollection<StarWarsMovie> movies = pipeline.apply(
TransBeamer.newReader(
CsvFormat.create(), // Choose the format here
"input", // Where to read the CSV files from
StarWarsMovie.class // The Avro class to use
).withFilePrefix("starwars") // A prefix to filter the files on
);To read data in other formats, use one of the other DataFormat implementations
(i.e. Parquet.create()) when creating the reader.
Assuming you already have a PCollection backed by Avro objects, writing them
out is a straight-forward affair:
PCollection<StarWarsMovie> movies = /* created elsewhere */;
movies.apply(
TransBeamer.newWriter(
ParquetFormat.create(), // The format to write
"build", // Where to write the files
StarWarsMovie.class // The avro-backing class writing from
)
);TransBeamer can also transform your data to and from Pubsub. Create a Topic in your GCP Project, and then ...
// Read in CSV files
Pipeline pipeline = Pipeline.create();
final PCollection<StarWarsMovie> movies = pipeline.apply(
TransBeamer.newReader(
CsvFormat.create(), "input", StarWarsMovie.class
)
);
// Write each entry as a Pubsub message
final String myTopic = "/projects/my-project/topics/my-topic-name";
movies.apply(
TransBeamer.newWriter(
PubsubFormat.withTopic(myTopic), // The topic to write to
StarWarsMovie.class // The Avro class of "movies"
)
);- Java >= 11
- Apache Beam >= 2.63
- Apache Avro >= 1.11.X
Building requires Java 21+.
git clone https://github.com/sanuscorp/transbeamer.git
cd transbeamer
./gradlew build./gradlew check
...
./gradlew test
...- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License—see the LICENSE file for details.
For questions, issues, or contributions, please:
- Open an issue on GitHub Issues
TransBeamer is developed and maintained by Sanus Software & Services.