Thanks to visit codestin.com
Credit goes to github.com

Skip to content

test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175

Draft
zubron wants to merge 13 commits intomasterfrom
task/check-lakefsfs-spark4-compatibility
Draft

test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175
zubron wants to merge 13 commits intomasterfrom
task/check-lakefsfs-spark4-compatibility

Conversation

@zubron
Copy link
Contributor

@zubron zubron commented Feb 23, 2026

Adds Spark 4.0.0 and 4.1.1 to the existing Spark integration test suite to verify lakeFS compatibility.

Closes https://github.com/treeverse/product/issues/1076

Findings

lakeFSFS is compatible with Spark 4 provided the AWS SDK V1 is available. lakeFS S3 Gateway is also compatible, however the redirect optimization no longer works.

What was tested

Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1, AWS SDK v2) was added to the existing integration test suite (see #10175).

Access mode Result
lakeFSFS simple (lakefs://, cluster has S3 creds) ✅ Pass
lakeFSFS presigned (lakefs://, no S3 creds needed) ✅ Pass
S3 Gateway (s3a://) ✅ Pass
S3 Gateway + redirect optimization (s3a://, lakeFS responding with 307 to presigned URL) ❌ Broken

Why S3 Gateway redirect optimization is broken

Spark 4.0.0 upgrades Hadoop from 3.3.4 to 3.4.1 (release notes) and removes the AWS SDK v1 bundle. Hadoop 3.4.0 migrated S3A to AWS SDK v2 (HADOOP-18073).

AWS SDK v2 does not follow HTTP redirects at the transport layer. This is intentional (aws-sdk-java-v2#975, aws-sdk-java-v2#989). Both HTTP clients used by S3A treat non-2xx responses as errors. When lakeFS returns a 307, Hadoop's S3AUtils.translateException() treats it as an S3 region redirect, looks for an x-amz-bucket-region header that lakeFS doesn't set, and throws AWSRedirectException: redirect to region null (S3AUtils.java).

This is not fixable via configuration. The QueryStringSignerType signer, which was previously used to prevent double-signing of presigned URLs, also no longer exists in SDK v2. The redirect test has been restricted to Spark 3 only.

Impact

Users who can install custom JARs can migrate to lakeFSFS presigned mode, which provides the same direct-to-storage performance benefit as redirect mode. The migration requires adding the lakeFSFS JAR, switching URIs from s3a:// to lakefs://, and configuring the lakeFS endpoint and blockstore credentials.

Users who cannot install custom JARs (notably Databricks SQL Warehouses, lakeFS Databricks docs) have no direct-to-storage path on Spark 4. All data will proxy through lakeFS, which is a significant performance and cost concern for large data volumes.

Changes

  • Add sonnets-400 build target (Scala 2.13, Spark 4.0.0, Hadoop 3.4.1)
  • Add sonnets-411 build target (Scala 2.13, Spark 4.1.1, Hadoop 3.4.2)
  • Bump sbt to 1.9.7 for Scala 2.13 support
  • Switch Sonnets.scala from log4j to SLF4J (compatible with all Spark versions)
  • Split spark-prep build: Java 8 for Spark 2/3, Java 17 for Spark 4
  • Expand esti Spark matrix from single tag: 4 to tag: "4.0.0" and tag: "4.1.1" with repo_suffix to
    avoid dots in lakeFS repository names
  • Restrict S3 Gateway redirect test to Spark 3 only
  • Add contract-tests-hadoop342 profile to hadoopfs client for testing against Hadoop 3.4.2 (with explicit
    aws-java-sdk-bundle and assertj-core test deps)
  • Run Hadoop 3.4.2 contract tests in the esti workflow
  • Add docker-compose extra_hosts for gateway-test-spark400 and gateway-test-spark411

Add Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1) as a new build target for
the Sonnets test app alongside the existing Spark 2.4.6 and 3.1.1
targets. This enables CI to verify lakeFSFS compatibility with Spark 4.

Changes:
- Add sonnets-400 build target (Scala 2.13, Spark 4.0.0, Hadoop 3.4.1)
- Switch Sonnets.scala from log4j to SLF4J (compatible with all Spark versions)
- Update sbt to 1.9.7 for Scala 2.13 support
- Add Spark 4.0.0 to the esti.yaml Spark test matrix (all access modes)
- Add Spark 4 compatibility test job against latest lakeFS (1.78.0)
- Split spark-prep to use Java 8 for Spark 2/3 and Java 17 for Spark 4

Refs: treeverse/product#1076

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@zubron zubron added area/client/spark exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached labels Feb 23, 2026
@github-actions github-actions bot added area/testing Improvements or additions to tests area/ci and removed area/client/spark labels Feb 23, 2026
zubron and others added 7 commits February 23, 2026 17:16
lakeFS repository names must match ^[a-z0-9][a-z0-9-]{2,62} (no dots).
The Spark 4.0.0 Docker tag contains dots, which broke repo creation.
Add a separate repo_suffix field for valid repository names.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use tag '4' instead of '4.0.0' — same digest, consistent with existing
'2' and '3' tags, and avoids dots in repository names.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The S3 gateway virtual-hosted-style bucket URLs need DNS entries in the
Spark containers. Add gateway-test-spark4 and gateway-redirect-test-spark4.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Redirect tests use path-style S3 access, so they don't need
virtual-hosted-style DNS entries.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4 (Hadoop 3.4.1) uses AWS SDK v2, which removed the
QueryStringSignerType signer used by the redirect test. Restrict the
redirect test to Spark 3 only (restoring the original condition).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4 (Hadoop 3.4.1, AWS SDK v2) doesn't support QueryStringSignerType.
Skip setting the signing algorithm override for Spark 4 and let it use
the default SigV4 signer. The redirect feature only needs path-style
access and the s3RedirectionSupport user agent prefix.

Re-enable the redirect test for Spark 4.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
AWS SDK v2 (Hadoop 3.4.1) interprets lakeFS's 307 redirect as an S3
region redirect rather than following it to the presigned URL. This is a
known incompatibility — users on Spark 4 should use lakeFSFS with
presigned mode instead of S3 gateway redirect.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@arielshaqed
Copy link
Contributor

I do not think lakeFSFS compatibility run automatically on PRs (because reasons), so I ran them here.

@arielshaqed
Copy link
Contributor

The S3 Gateway redirect failure is due to AWS SDK v2 (bundled with Hadoop 3.4.1) intercepting all 3xx responses at the S3 service layer and treating them as region redirects, rather than following them as HTTP redirects. The QueryStringSignerType signer, which was required to prevent double-signing of presigned URLs, also no longer exists in SDK v2. This is not fixable via configuration. The redirect test has been restricted to Spark 3 only.

  1. We must provide a reference, for users and/or for ourselves.
  2. Is it possible to use a suitable Hadoop version which can do this? (Possibly related to above question.)

lakeFSFS presigned mode provides the same direct-to-storage performance benefit as the redirect mode and is the recommended alternative for Spark 4 users.

I worry this is false. It provides the same direct-to-storage performance benefits. But it requires very different configuration - which our users might not have. It forces all users to change to that configuration - now they need to configure lakeFSFS with blockstore credentials. This configuration might not be possible.

@zubron
Copy link
Contributor Author

zubron commented Feb 25, 2026

@arielshaqed

  1. We must provide a reference, for users and/or for ourselves.
  2. Is it possible to use a suitable Hadoop version which can do this? (Possibly related to above question.)

Here's the additional context and sources that should have been in the original description:

Sources for the redirect incompatibility

The dependency chain:

  • Spark 4.0.0 upgrades Hadoop from 3.3.4 to 3.4.1 and removes aws-java-sdk-bundle (AWS SDK V1 jar) (Spark 4.0.0 Release Notes)
  • Hadoop 3.4.0 migrated S3A from AWS SDK v1 to v2 (HADOOP-18073)

AWS SDK v2 does not follow HTTP redirects at the transport layer. This appears to be an intentional design decision so that redirect handling can be done at the SDK layer rather than the HTTP layer (aws-sdk-java-v2#975, aws-sdk-java-v2#989).

Hadoop's S3A DefaultS3ClientFactory uses two SDK HTTP clients (configured in AWSClientConfig), neither of which follows redirects:

  • ApacheHttpClient (sync): redirect following is explicitly disabled via .disableRedirectHandling() (source)
  • NettyNioAsyncHttpClient (async): Netty doesn't support redirects by default, and the AWS SDK netty client doesn't seem to expose any way to enable it.

When the lakeFS S3 gateway returns a 307 redirect, the SDK treats it as an error and unmarshalls it into an S3Exception. Hadoop's S3AUtils.translateException() wraps 301/307 status codes as AWSRedirectException, expecting an S3 region redirect with an x-amz-bucket-region header. lakeFS doesn't set this header, so the region resolves to null. (S3AUtils.java)

Possible workaround

Spark 4 offers a Hadoop-free build where users provide their own Hadoop version. Pairing Spark 4 with Hadoop 3.3.x (SDK v1) could preserve redirect support. I haven't been able to find any information to know how well this would work. It's also not an option on managed platforms like Databricks or EMR where the Hadoop version is fixed.

Impact on users who cannot use lakeFSFS

The PR description recommended lakeFSFS presigned as the alternative. This works for users who can install custom JARs, but as you mentioned, that may not be an option.

From my understanding, these are the use cases that are impacted:

  • Databricks SQL Warehouses do not allow external JARs (lakeFS Databricks docs), so lakeFSFS is not an option. These users are limited to the S3 Gateway, and on Spark 4 the loss of redirect means all data must proxy through lakeFS. For users with large data volumes this is a significant performance and cost concern.
  • Databricks Unity Catalog "does not respect cluster configurations for filesystem settings" when accessing data through Unity Catalog (Databricks docs).

For these users, there is currently no direct-to-storage path on Spark 4.

@zubron zubron force-pushed the task/check-lakefsfs-spark4-compatibility branch from d833757 to 8fc4531 Compare February 26, 2026 11:35
zubron and others added 2 commits February 26, 2026 16:51
Integrate Spark 4.0.0 and 4.1.1 into the existing esti and compatibility
test matrices instead of having a separate Spark 4 job. Uses repo_suffix
to avoid dots in lakeFS repository names, and matrix include to limit
Spark 4.x compat tests to lakeFS >= 1.78.0.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4.1.1 ships with hadoop-aws 3.4.2 which requires the AWS SDK v2
bundle at runtime. Update the build to compile against the matching
Hadoop version.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions github-actions bot added dependencies Pull requests that update a dependency file area/client/hadoopfs labels Feb 27, 2026
Add contract-tests-hadoop342 Maven profile for testing the hadoopfs
client against Hadoop 3.4.2 (used by Spark 4.1.1). The profile includes
aws-java-sdk-bundle as a provided dependency since hadoop-aws 3.4.2
changed it from compile to provided scope.

LakeFSFileSystemServerS3Test is excluded from this profile because
Hadoop 3.4.2's S3A gets 403 errors when accessing minio in the test
environment. This needs further investigation.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@zubron zubron force-pushed the task/check-lakefsfs-spark4-compatibility branch from 73e3a61 to 4a3f4a1 Compare February 27, 2026 23:05
zubron and others added 2 commits March 2, 2026 14:08
Add fs.s3a.path.style.access=true to S3FSTestBase so Hadoop 3.4.2's
S3A client (AWS SDK v2) uses path-style URLs when talking to MinIO.
This fixes the 403 errors that caused LakeFSFileSystemServerS3Test to
be excluded from the contract-tests-hadoop342 profile.

Also remove unused gateway-test-spark4 host entries from docker-compose.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
On CI, AWS environment variables from the runner are picked up by
Hadoop 3.4.2's SDK v2 credential chain before the Hadoop config
credentials. This causes 403 errors when the CI runner's AWS creds
are sent to MinIO instead of the test credentials.

Explicitly set the credential provider to SimpleAWSCredentialsProvider
to ensure only the Hadoop config access/secret keys are used.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/client/hadoopfs area/testing Improvements or additions to tests dependencies Pull requests that update a dependency file exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached mostly-ai

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants