Codestin Search App

zubron · 2026-02-23T22:09:28Z

Adds Spark 4.0.0 and 4.1.1 to the existing Spark integration test suite to verify lakeFS compatibility.

Closes https://github.com/treeverse/product/issues/1076

Findings

lakeFSFS is compatible with Spark 4 provided the AWS SDK V1 is available. lakeFS S3 Gateway is also compatible, however the redirect optimization no longer works.

What was tested

Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1, AWS SDK v2) was added to the existing integration test suite (see #10175).

Access mode	Result
lakeFSFS simple (`lakefs://`, cluster has S3 creds)	✅ Pass
lakeFSFS presigned (`lakefs://`, no S3 creds needed)	✅ Pass
S3 Gateway (`s3a://`)	✅ Pass
S3 Gateway + redirect optimization (`s3a://`, lakeFS responding with 307 to presigned URL)	❌ Broken

Why S3 Gateway redirect optimization is broken

Spark 4.0.0 upgrades Hadoop from 3.3.4 to 3.4.1 (release notes) and removes the AWS SDK v1 bundle. Hadoop 3.4.0 migrated S3A to AWS SDK v2 (HADOOP-18073).

AWS SDK v2 does not follow HTTP redirects at the transport layer. This is intentional (aws-sdk-java-v2#975, aws-sdk-java-v2#989). Both HTTP clients used by S3A treat non-2xx responses as errors. When lakeFS returns a 307, Hadoop's S3AUtils.translateException() treats it as an S3 region redirect, looks for an x-amz-bucket-region header that lakeFS doesn't set, and throws AWSRedirectException: redirect to region null (S3AUtils.java).

This is not fixable via configuration. The QueryStringSignerType signer, which was previously used to prevent double-signing of presigned URLs, also no longer exists in SDK v2. The redirect test has been restricted to Spark 3 only.

Impact

Users who can install custom JARs can migrate to lakeFSFS presigned mode, which provides the same direct-to-storage performance benefit as redirect mode. The migration requires adding the lakeFSFS JAR, switching URIs from s3a:// to lakefs://, and configuring the lakeFS endpoint and blockstore credentials.

Users who cannot install custom JARs (notably Databricks SQL Warehouses, lakeFS Databricks docs) have no direct-to-storage path on Spark 4. All data will proxy through lakeFS, which is a significant performance and cost concern for large data volumes.

Changes

Add sonnets-400 build target (Scala 2.13, Spark 4.0.0, Hadoop 3.4.1)
Add sonnets-411 build target (Scala 2.13, Spark 4.1.1, Hadoop 3.4.2)
Bump sbt to 1.9.7 for Scala 2.13 support
Switch Sonnets.scala from log4j to SLF4J (compatible with all Spark versions)
Split spark-prep build: Java 8 for Spark 2/3, Java 17 for Spark 4
Expand esti Spark matrix from single tag: 4 to tag: "4.0.0" and tag: "4.1.1" with repo_suffix to
avoid dots in lakeFS repository names
Restrict S3 Gateway redirect test to Spark 3 only
Add contract-tests-hadoop342 profile to hadoopfs client for testing against Hadoop 3.4.2 (with explicit
aws-java-sdk-bundle and assertj-core test deps)
Run Hadoop 3.4.2 contract tests in the esti workflow
Add docker-compose extra_hosts for gateway-test-spark400 and gateway-test-spark411

Add Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1) as a new build target for the Sonnets test app alongside the existing Spark 2.4.6 and 3.1.1 targets. This enables CI to verify lakeFSFS compatibility with Spark 4. Changes: - Add sonnets-400 build target (Scala 2.13, Spark 4.0.0, Hadoop 3.4.1) - Switch Sonnets.scala from log4j to SLF4J (compatible with all Spark versions) - Update sbt to 1.9.7 for Scala 2.13 support - Add Spark 4.0.0 to the esti.yaml Spark test matrix (all access modes) - Add Spark 4 compatibility test job against latest lakeFS (1.78.0) - Split spark-prep to use Java 8 for Spark 2/3 and Java 17 for Spark 4 Refs: treeverse/product#1076 Co-Authored-By: Claude Opus 4.6 <[email protected]>

lakeFS repository names must match ^[a-z0-9][a-z0-9-]{2,62} (no dots). The Spark 4.0.0 Docker tag contains dots, which broke repo creation. Add a separate repo_suffix field for valid repository names. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Use tag '4' instead of '4.0.0' — same digest, consistent with existing '2' and '3' tags, and avoids dots in repository names. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The S3 gateway virtual-hosted-style bucket URLs need DNS entries in the Spark containers. Add gateway-test-spark4 and gateway-redirect-test-spark4. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Redirect tests use path-style S3 access, so they don't need virtual-hosted-style DNS entries. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Spark 4 (Hadoop 3.4.1) uses AWS SDK v2, which removed the QueryStringSignerType signer used by the redirect test. Restrict the redirect test to Spark 3 only (restoring the original condition). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Spark 4 (Hadoop 3.4.1, AWS SDK v2) doesn't support QueryStringSignerType. Skip setting the signing algorithm override for Spark 4 and let it use the default SigV4 signer. The redirect feature only needs path-style access and the s3RedirectionSupport user agent prefix. Re-enable the redirect test for Spark 4. Co-Authored-By: Claude Opus 4.6 <[email protected]>

AWS SDK v2 (Hadoop 3.4.1) interprets lakeFS's 307 redirect as an S3 region redirect rather than following it to the presigned URL. This is a known incompatibility — users on Spark 4 should use lakeFSFS with presigned mode instead of S3 gateway redirect. Co-Authored-By: Claude Opus 4.6 <[email protected]>

arielshaqed · 2026-02-25T07:47:43Z

I do not think lakeFSFS compatibility run automatically on PRs (because reasons), so I ran them here.

arielshaqed · 2026-02-25T08:15:18Z

The S3 Gateway redirect failure is due to AWS SDK v2 (bundled with Hadoop 3.4.1) intercepting all 3xx responses at the S3 service layer and treating them as region redirects, rather than following them as HTTP redirects. The QueryStringSignerType signer, which was required to prevent double-signing of presigned URLs, also no longer exists in SDK v2. This is not fixable via configuration. The redirect test has been restricted to Spark 3 only.

We must provide a reference, for users and/or for ourselves.
Is it possible to use a suitable Hadoop version which can do this? (Possibly related to above question.)

lakeFSFS presigned mode provides the same direct-to-storage performance benefit as the redirect mode and is the recommended alternative for Spark 4 users.

I worry this is false. It provides the same direct-to-storage performance benefits. But it requires very different configuration - which our users might not have. It forces all users to change to that configuration - now they need to configure lakeFSFS with blockstore credentials. This configuration might not be possible.

zubron · 2026-02-25T21:40:52Z

@arielshaqed

We must provide a reference, for users and/or for ourselves.

Is it possible to use a suitable Hadoop version which can do this? (Possibly related to above question.)

Here's the additional context and sources that should have been in the original description:

Sources for the redirect incompatibility

The dependency chain:

Spark 4.0.0 upgrades Hadoop from 3.3.4 to 3.4.1 and removes aws-java-sdk-bundle (AWS SDK V1 jar) (Spark 4.0.0 Release Notes)
Hadoop 3.4.0 migrated S3A from AWS SDK v1 to v2 (HADOOP-18073)

AWS SDK v2 does not follow HTTP redirects at the transport layer. This appears to be an intentional design decision so that redirect handling can be done at the SDK layer rather than the HTTP layer (aws-sdk-java-v2#975, aws-sdk-java-v2#989).

Hadoop's S3A DefaultS3ClientFactory uses two SDK HTTP clients (configured in AWSClientConfig), neither of which follows redirects:

ApacheHttpClient (sync): redirect following is explicitly disabled via .disableRedirectHandling() (source)
NettyNioAsyncHttpClient (async): Netty doesn't support redirects by default, and the AWS SDK netty client doesn't seem to expose any way to enable it.

When the lakeFS S3 gateway returns a 307 redirect, the SDK treats it as an error and unmarshalls it into an S3Exception. Hadoop's S3AUtils.translateException() wraps 301/307 status codes as AWSRedirectException, expecting an S3 region redirect with an x-amz-bucket-region header. lakeFS doesn't set this header, so the region resolves to null. (S3AUtils.java)

Possible workaround

Spark 4 offers a Hadoop-free build where users provide their own Hadoop version. Pairing Spark 4 with Hadoop 3.3.x (SDK v1) could preserve redirect support. I haven't been able to find any information to know how well this would work. It's also not an option on managed platforms like Databricks or EMR where the Hadoop version is fixed.

Impact on users who cannot use lakeFSFS

The PR description recommended lakeFSFS presigned as the alternative. This works for users who can install custom JARs, but as you mentioned, that may not be an option.

From my understanding, these are the use cases that are impacted:

Databricks SQL Warehouses do not allow external JARs (lakeFS Databricks docs), so lakeFSFS is not an option. These users are limited to the S3 Gateway, and on Spark 4 the loss of redirect means all data must proxy through lakeFS. For users with large data volumes this is a significant performance and cost concern.
Databricks Unity Catalog "does not respect cluster configurations for filesystem settings" when accessing data through Unity Catalog (Databricks docs).

For these users, there is currently no direct-to-storage path on Spark 4.

Integrate Spark 4.0.0 and 4.1.1 into the existing esti and compatibility test matrices instead of having a separate Spark 4 job. Uses repo_suffix to avoid dots in lakeFS repository names, and matrix include to limit Spark 4.x compat tests to lakeFS >= 1.78.0. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Spark 4.1.1 ships with hadoop-aws 3.4.2 which requires the AWS SDK v2 bundle at runtime. Update the build to compile against the matching Hadoop version. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add contract-tests-hadoop342 Maven profile for testing the hadoopfs client against Hadoop 3.4.2 (used by Spark 4.1.1). The profile includes aws-java-sdk-bundle as a provided dependency since hadoop-aws 3.4.2 changed it from compile to provided scope. LakeFSFileSystemServerS3Test is excluded from this profile because Hadoop 3.4.2's S3A gets 403 errors when accessing minio in the test environment. This needs further investigation. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add fs.s3a.path.style.access=true to S3FSTestBase so Hadoop 3.4.2's S3A client (AWS SDK v2) uses path-style URLs when talking to MinIO. This fixes the 403 errors that caused LakeFSFileSystemServerS3Test to be excluded from the contract-tests-hadoop342 profile. Also remove unused gateway-test-spark4 host entries from docker-compose. Co-Authored-By: Claude Opus 4.6 <[email protected]>

On CI, AWS environment variables from the runner are picked up by Hadoop 3.4.2's SDK v2 credential chain before the Hadoop config credentials. This causes 403 errors when the CI runner's AWS creds are sent to MinIO instead of the test credentials. Explicitly set the credential provider to SimpleAWSCredentialsProvider to ensure only the Hadoop config access/secret keys are used. Co-Authored-By: Claude Opus 4.6 <[email protected]>

zubron added area/client/spark exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached labels Feb 23, 2026

github-actions bot added area/testing Improvements or additions to tests area/ci and removed area/client/spark labels Feb 23, 2026

zubron added the mostly-ai label Feb 23, 2026

zubron and others added 7 commits February 23, 2026 17:16

fix: use short Spark 4 image tag to match existing convention

2fd4e4d

Use tag '4' instead of '4.0.0' — same digest, consistent with existing '2' and '3' tags, and avoids dots in repository names. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: add Spark 4 S3 gateway host entries to docker-compose

a75a6c8

The S3 gateway virtual-hosted-style bucket URLs need DNS entries in the Spark containers. Add gateway-test-spark4 and gateway-redirect-test-spark4. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: remove unnecessary redirect host entries from docker-compose

b74c5c1

Redirect tests use path-style S3 access, so they don't need virtual-hosted-style DNS entries. Co-Authored-By: Claude Opus 4.6 <[email protected]>

github-actions bot added the area/client/hadoopfs label Feb 26, 2026

zubron force-pushed the task/check-lakefsfs-spark4-compatibility branch from d833757 to 8fc4531 Compare February 26, 2026 11:35

github-actions bot removed the area/client/hadoopfs label Feb 26, 2026

zubron and others added 2 commits February 26, 2026 16:51

fix: use Hadoop 3.4.2 for Spark 4.1.1 build

5df6f76

Spark 4.1.1 ships with hadoop-aws 3.4.2 which requires the AWS SDK v2 bundle at runtime. Update the build to compile against the matching Hadoop version. Co-Authored-By: Claude Opus 4.6 <[email protected]>

github-actions bot added dependencies Pull requests that update a dependency file area/client/hadoopfs labels Feb 27, 2026

zubron force-pushed the task/check-lakefsfs-spark4-compatibility branch from 73e3a61 to 4a3f4a1 Compare February 27, 2026 23:05

zubron and others added 2 commits March 2, 2026 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175

test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175
zubron wants to merge 13 commits intomasterfrom
task/check-lakefsfs-spark4-compatibility

zubron commented Feb 23, 2026 •

edited

Loading

Uh oh!

arielshaqed commented Feb 25, 2026

Uh oh!

arielshaqed commented Feb 25, 2026

Uh oh!

zubron commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zubron commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Findings

What was tested

Why S3 Gateway redirect optimization is broken

Impact

Changes

Uh oh!

arielshaqed commented Feb 25, 2026

Uh oh!

arielshaqed commented Feb 25, 2026

Uh oh!

zubron commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sources for the redirect incompatibility

Possible workaround

Impact on users who cannot use lakeFSFS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zubron commented Feb 23, 2026 •

edited

Loading

zubron commented Feb 25, 2026 •

edited

Loading