Jaeger Spark dependencies

This is a Spark job that collects spans from storage, analyze links between services, and stores them for later presentation in the UI. Note that it is needed for the production deployment. all-in-one distribution does not need this job.

This job parses all traces on a given day, based on UTC. By default, it processes the current day, but other days can be explicitly specified.

Quick-start

Spark job can be run as docker container and also as java executable:

Container Image Variants

Starting with version 0.6.x, Docker images are published with variant-specific tags. Each variant automatically uses the appropriate storage backend, so the STORAGE environment variable is no longer needed.

The images are named ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:{VERSION}-{VARIANT}:

VERSION-cassandra: For Cassandra storage (uses CassandraDependenciesJob directly)
VERSION-elasticsearch7: For Elasticsearch 7.12-7.16 (uses ElasticsearchDependenciesJob with ES connector 7.17.29)
VERSION-elasticsearch8: For Elasticsearch 7.17+ and 8.x (uses ElasticsearchDependenciesJob with ES connector 8.13.4)
VERSION-elasticsearch9: For Elasticsearch 9.x (uses ElasticsearchDependenciesJob with ES connector 9.1.3) - also tagged as :latest
VERSION-opensearch: For OpenSearch 2.x and 3.x (uses OpenSearchDependenciesJob with OpenSearch Java client)

Example for Cassandra:

$ docker run \
  --env CASSANDRA_CONTACT_POINTS=host1,host2 \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-cassandra

Example for Elasticsearch 8.x:

$ docker run \
  --env ES_NODES=http://elasticsearch:9200 \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-elasticsearch8

Example for OpenSearch:

$ docker run \
  --env OS_NODES=http://opensearch:9200 \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-opensearch

Advanced Configuration

Use --env JAVA_OPTS to pass additional Java options such as memory settings, SSL trust store, or other JVM properties:

# Example: Configure SSL trust store
$ docker run \
  --env ES_NODES=https://elasticsearch:9200 \
  --env JAVA_OPTS="-Djavax.net.ssl.trustStore=/path/to/truststore -Djavax.net.ssl.trustStorePassword=changeit" \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-elasticsearch8

# Example: Increase JVM heap size
$ docker run \
  --env OS_NODES=http://opensearch:9200 \
  --env JAVA_OPTS="-Xmx2g -Xms1g" \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-opensearch

Use --env LOG4J_STATUS_LOGGER_LEVEL to control Log4j2 internal status messages (defaults to OFF):

# Example: Enable Log4j2 debug logging for troubleshooting
$ docker run \
  --env OS_NODES=http://opensearch:9200 \
  --env LOG4J_STATUS_LOGGER_LEVEL=DEBUG \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-opensearch

Note: the latest versions are hosted on ghcr.io, not on Docker Hub.

As jar file:

STORAGE=cassandra java -jar jaeger-spark-dependencies.jar

Usage

By default, this job parses all traces since midnight UTC. You can parse traces for a different day via an argument in YYYY-mm-dd format, like 2016-07-16 or specify the date via an env property.

# ex to run the job to process yesterday's traces on OS/X
$ STORAGE=cassandra java -jar jaeger-spark-dependencies.jar `date -uv-1d +%F`
# or on Linux
$ STORAGE=cassandra java -jar jaeger-spark-dependencies.jar `date -u -d '1 day ago' +%F`

Configuration

jaeger-spark-dependencies applies configuration parameters through environment variables.

The following variables are common to all storage layers:

SPARK_MASTER: Spark master to submit the job to; Defaults to local[*]
DATE: Date in YYYY-mm-dd format. Denotes a day for which dependency links will be created.
PEER_SERVICE_TAG: Tag name used to identify peer service in spans. Defaults to peer.service
JAVA_OPTS: Additional Java options to pass to the JVM. Use this to configure memory, SSL properties, or other JVM settings. Example: JAVA_OPTS="-Xmx2g -Djavax.net.ssl.trustStore=/path/to/truststore". Note: The required --add-opens flags for Spark on Java 21+ are already included in the container image.
LOG4J_STATUS_LOGGER_LEVEL: Log4j2 StatusLogger level. Defaults to OFF to suppress internal Log4j2 status messages. Set to TRACE, DEBUG, INFO, WARN, ERROR, or FATAL if you need to debug logging configuration issues.

Cassandra

Cassandra is used when STORAGE=cassandra.

CASSANDRA_KEYSPACE: The keyspace to use. Defaults to "jaeger_v1_dc1".
CASSANDRA_CONTACT_POINTS: Comma separated list of hosts / ip addresses part of Cassandra cluster. Defaults to localhost
CASSANDRA_LOCAL_DC: The local DC to connect to (other nodes will be ignored)
CASSANDRA_USERNAME and CASSANDRA_PASSWORD: Cassandra authentication. Will throw an exception on startup if authentication fails
CASSANDRA_USE_SSL: Requires javax.net.ssl.trustStore and javax.net.ssl.trustStorePassword, Defaults to false.
CASSANDRA_CLIENT_AUTH_ENABLED: If set enables client authentication on SSL connections. Requires javax.net.ssl.keyStore and javax.net.ssl.keyStorePassword, defaults to false.

Example usage:

$ STORAGE=cassandra CASSANDRA_CONTACT_POINTS=localhost:9042 java -jar jaeger-spark-dependencies.jar

Elasticsearch

Elasticsearch is used when STORAGE=elasticsearch.

Important: Use the appropriate Docker image variant for your Elasticsearch version:

ES 7.12-7.16: Use :VERSION-elasticsearch7 tag
ES 7.17-8.x: Use :VERSION-elasticsearch8 tag
ES 9.x: Use :VERSION-elasticsearch9 tag (or :latest)

Configuration

ES_NODES: A comma separated list of elasticsearch hosts advertising http. Defaults to 127.0.0.1. Add port section if not listening on port 9200. Only one of these hosts needs to be available to fetch the remaining nodes in the cluster. It is recommended to set this to all the master nodes of the cluster. Use url format for SSL. For example, "https://yourhost:8888"
ES_NODES_WAN_ONLY: Set to true to only use the values set in ES_NODES, for example if your elasticsearch cluster is in Docker. If you're using a cloudprovider such as AWS Elasticsearch, set this to true. Defaults to false
ES_USERNAME and ES_PASSWORD: Elasticsearch basic authentication. Use when X-Pack security (formerly Shield) is in place. By default no username or password is provided to elasticsearch.
ES_CLIENT_NODE_ONLY: Set to true to disable elasticsearch cluster nodes.discovery and enable nodes.client.only. If your elasticsearch cluster's data nodes only listen on loopback ip, set this to true. Defaults to false.
ES_INDEX_PREFIX: index prefix of Jaeger indices. By default unset.
ES_INDEX_DATE_SEPARATOR: index date separator of Jaeger indices. The default value is -. For example . will find index "jaeger-span-2020.11.25".
ES_TIME_RANGE: How far in the past the job should look to for spans, the maximum and default is 24h. Any value accepted by date-math can be used here, but the anchor is always now.
ES_USE_ALIASES: Set to true to use index alias names to read from and write to. Usually required when using rollover indices.

Example usage:

$ STORAGE=elasticsearch ES_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies.jar

OpenSearch

OpenSearch is used when STORAGE=opensearch.

Important: Use the :VERSION-opensearch Docker image variant.

Configuration

OS_NODES: A comma separated list of OpenSearch hosts advertising http. Defaults to 127.0.0.1. Add port section if not listening on port 9200. Only one of these hosts needs to be available to fetch the remaining nodes in the cluster. It is recommended to set this to all the master nodes of the cluster. Use url format for SSL. For example, "https://yourhost:8888"
OS_NODES_WAN_ONLY: Set to true to only use the values set in OS_NODES, for example if your OpenSearch cluster is in Docker. If you're using a cloudprovider such as AWS OpenSearch, set this to true. Defaults to false.
OS_USERNAME and OS_PASSWORD: OpenSearch basic authentication. By default no username or password is provided.
OS_INDEX_PREFIX: index prefix of Jaeger indices. By default unset.
OS_INDEX_DATE_SEPARATOR: index date separator of Jaeger indices. The default value is -. For example . will find index "jaeger-span-2020.11.25".
OS_TIME_RANGE: How far in the past the job should look to for spans, the maximum and default is 24h. Any value accepted by date-math can be used here, but the anchor is always now.

Example usage:

$ docker run \
  --env OS_NODES=http://opensearch:9200 \
  ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:v0.5.3-opensearch

Design

At a high-level, this job does the following:

read lots of spans from a time period
group them by traceId
construct a graph using parent-child relationships expressed in span references
for each edge (parent span, child span) output (parent service, child service, count)
write the results to the database (e.g. dependencies_v2 table in Cassandra)

Building locally

To build the job locally and run tests:

./mvnw clean install # if failed add SPARK_LOCAL_IP=127.0.0.1

To run the unified jar (includes all):

STORAGE=cassandra java -jar jaeger-spark-dependencies/target/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar
#or
STORAGE=elasticsearch ES_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies/target/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar
#or
STORAGE=opensearch OS_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies/target/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar

To run storage-specific jars directly (without STORAGE variable):

# Cassandra
java -jar jaeger-spark-dependencies-cassandra/target/jaeger-spark-dependencies-cassandra-0.0.1-SNAPSHOT.jar
# Elasticsearch
ES_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies-elasticsearch/target/jaeger-spark-dependencies-elasticsearch-0.0.1-SNAPSHOT.jar
# OpenSearch
OS_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies-opensearch/target/jaeger-spark-dependencies-opensearch-0.0.1-SNAPSHOT.jar

To build Docker image:

Note: The Dockerfile now requires a pre-built JAR. First build the JAR using Maven, then build the Docker image.

For Cassandra:

./mvnw clean package --batch-mode -Dlicense.skip=true -DskipTests -pl jaeger-spark-dependencies-cassandra -am
mkdir -p artifact-target
cp jaeger-spark-dependencies-cassandra/target/jaeger-spark-dependencies-cassandra-0.0.1-SNAPSHOT.jar artifact-target/
docker build --build-arg VARIANT=cassandra -t jaegertracing/spark-dependencies:cassandra .

For Elasticsearch 9:

./mvnw clean package --batch-mode -Dlicense.skip=true -DskipTests -Dversion.elasticsearch.spark=9.1.3 -pl jaeger-spark-dependencies-elasticsearch -am
mkdir -p artifact-target
cp jaeger-spark-dependencies-elasticsearch/target/jaeger-spark-dependencies-elasticsearch-0.0.1-SNAPSHOT.jar artifact-target/
docker build --build-arg VARIANT=elasticsearch9 -t jaegertracing/spark-dependencies:elasticsearch9 .

In tests it's possible to specify version of Jaeger images by env variable JAEGER_VERSION or system property jaeger.version. By default tests are using latest images.

Running Integration Tests

The integration tests validate the Spark dependencies job against different storage backends:

Cassandra 4.x
Elasticsearch 7
Elasticsearch 8
Elasticsearch 9

Prerequisites

Before running integration tests, ensure you have the following installed:

Java 21 (Temurin distribution recommended)
Docker (for building images and running testcontainers)
Maven (included via ./mvnw wrapper)

Quick Start

Use the following make targets to run integration tests:

make e2e-cassandra  # Run Cassandra integration tests
make e2e-es7        # Run Elasticsearch 7 integration tests
make e2e-es8        # Run Elasticsearch 8 integration tests
make e2e-es9        # Run Elasticsearch 9 integration tests

What Each Target Does

Each test suite performs two steps:

Builds a Docker image with the appropriate storage variant
Runs tests using testcontainers against that variant

Environment Variables

The following environment variables are used in integration tests:

SPARK_DEPENDENCIES_JOB_TAG: Specifies the Docker image tag to use in tests (e.g., test-cassandra, test-es7, test-es8, test-es9)
ELASTICSEARCH_VERSION: Specifies the Elasticsearch version for testcontainers to use
JAEGER_VERSION: (Optional) Specifies the version of Jaeger images to use in tests. Defaults to latest.

You can also set this as a system property:

./mvnw test -Djaeger.version=2.14.0

Troubleshooting

Docker Permission Issues

If you encounter Docker permission issues, ensure your user is in the docker group:

sudo usermod -aG docker $USER

Then log out and log back in.

Testcontainers Issues

If testcontainers fail to start, ensure:

Docker is running and accessible
The Ryuk image is pulled: docker pull testcontainersofficial/ryuk:latest
You have sufficient disk space for Docker images

Build Failures

If you encounter build failures:

Ensure you have Java 21 installed
Clean the Maven cache: ./mvnw clean
Try running with the -U flag to force update dependencies: ./mvnw -U clean install

Port Conflicts

If tests fail due to port conflicts, ensure no other services are running on the ports used by testcontainers (typically ephemeral ports, but sometimes standard ports like 9042 for Cassandra or 9200 for Elasticsearch).

CI/CD Pipeline

The project uses a unified CI/CD pipeline (.github/workflows/ci-cd.yml) that implements a Host-Build Matrix Pattern:

Setup & Dependency Download - Downloads all Maven dependencies once and warms the cache for subsequent jobs
Build JARs - Builds storage-specific JARs on the GitHub runner (parallel for all variants)
E2E Tests - Tests each variant using Docker containers with pre-built JARs
Publish - Publishes multi-arch Docker images (linux/amd64, linux/arm64) to GitHub Container Registry

The pipeline supports four variants:

cassandra - For Cassandra storage
elasticsearch7 - For Elasticsearch 7.12-7.16 (ES connector 7.17.29)
elasticsearch8 - For Elasticsearch 7.17+ and 8.x (ES connector 8.13.4)
elasticsearch9 - For Elasticsearch 9.x (ES connector 9.1.3)

This approach eliminates Maven downloads inside Docker builds and parallelizes builds across all storage variants.

License

Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
jaeger-spark-dependencies-cassandra		jaeger-spark-dependencies-cassandra
jaeger-spark-dependencies-common		jaeger-spark-dependencies-common
jaeger-spark-dependencies-elasticsearch		jaeger-spark-dependencies-elasticsearch
jaeger-spark-dependencies-opensearch		jaeger-spark-dependencies-opensearch
jaeger-spark-dependencies-test		jaeger-spark-dependencies-test
jaeger-spark-dependencies		jaeger-spark-dependencies
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASES.md		RELEASES.md
entrypoint.sh		entrypoint.sh
header.txt		header.txt
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Jaeger Spark dependencies

Quick-start

Container Image Variants

Advanced Configuration

Usage

Configuration

Cassandra

Elasticsearch

Configuration

OpenSearch

Configuration

Design

Building locally

Running Integration Tests

Prerequisites

Quick Start

What Each Target Does

Environment Variables

Troubleshooting

Docker Permission Issues

Testcontainers Issues

Build Failures

Port Conflicts

CI/CD Pipeline

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors 24

Uh oh!

Languages

License

jaegertracing/spark-dependencies

Folders and files

Latest commit

History

Repository files navigation

Jaeger Spark dependencies

Quick-start

Container Image Variants

Advanced Configuration

Usage

Configuration

Cassandra

Elasticsearch

Configuration

OpenSearch

Configuration

Design

Building locally

Running Integration Tests

Prerequisites

Quick Start

What Each Target Does

Environment Variables

Troubleshooting

Docker Permission Issues

Testcontainers Issues

Build Failures

Port Conflicts

CI/CD Pipeline

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors 24

Uh oh!

Languages

Packages