test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175
test: add Spark 4.0 to lakeFSFS integration and compatibility tests#10175
Conversation
Add Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1) as a new build target for the Sonnets test app alongside the existing Spark 2.4.6 and 3.1.1 targets. This enables CI to verify lakeFSFS compatibility with Spark 4. Changes: - Add sonnets-400 build target (Scala 2.13, Spark 4.0.0, Hadoop 3.4.1) - Switch Sonnets.scala from log4j to SLF4J (compatible with all Spark versions) - Update sbt to 1.9.7 for Scala 2.13 support - Add Spark 4.0.0 to the esti.yaml Spark test matrix (all access modes) - Add Spark 4 compatibility test job against latest lakeFS (1.78.0) - Split spark-prep to use Java 8 for Spark 2/3 and Java 17 for Spark 4 Refs: treeverse/product#1076 Co-Authored-By: Claude Opus 4.6 <[email protected]>
lakeFS repository names must match ^[a-z0-9][a-z0-9-]{2,62} (no dots).
The Spark 4.0.0 Docker tag contains dots, which broke repo creation.
Add a separate repo_suffix field for valid repository names.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use tag '4' instead of '4.0.0' — same digest, consistent with existing '2' and '3' tags, and avoids dots in repository names. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The S3 gateway virtual-hosted-style bucket URLs need DNS entries in the Spark containers. Add gateway-test-spark4 and gateway-redirect-test-spark4. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Redirect tests use path-style S3 access, so they don't need virtual-hosted-style DNS entries. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4 (Hadoop 3.4.1) uses AWS SDK v2, which removed the QueryStringSignerType signer used by the redirect test. Restrict the redirect test to Spark 3 only (restoring the original condition). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4 (Hadoop 3.4.1, AWS SDK v2) doesn't support QueryStringSignerType. Skip setting the signing algorithm override for Spark 4 and let it use the default SigV4 signer. The redirect feature only needs path-style access and the s3RedirectionSupport user agent prefix. Re-enable the redirect test for Spark 4. Co-Authored-By: Claude Opus 4.6 <[email protected]>
AWS SDK v2 (Hadoop 3.4.1) interprets lakeFS's 307 redirect as an S3 region redirect rather than following it to the presigned URL. This is a known incompatibility — users on Spark 4 should use lakeFSFS with presigned mode instead of S3 gateway redirect. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
I do not think lakeFSFS compatibility run automatically on PRs (because reasons), so I ran them here. |
I worry this is false. It provides the same direct-to-storage performance benefits. But it requires very different configuration - which our users might not have. It forces all users to change to that configuration - now they need to configure lakeFSFS with blockstore credentials. This configuration might not be possible. |
Here's the additional context and sources that should have been in the original description: Sources for the redirect incompatibilityThe dependency chain:
AWS SDK v2 does not follow HTTP redirects at the transport layer. This appears to be an intentional design decision so that redirect handling can be done at the SDK layer rather than the HTTP layer (aws-sdk-java-v2#975, aws-sdk-java-v2#989). Hadoop's S3A
When the lakeFS S3 gateway returns a 307 redirect, the SDK treats it as an error and unmarshalls it into an Possible workaroundSpark 4 offers a Hadoop-free build where users provide their own Hadoop version. Pairing Spark 4 with Hadoop 3.3.x (SDK v1) could preserve redirect support. I haven't been able to find any information to know how well this would work. It's also not an option on managed platforms like Databricks or EMR where the Hadoop version is fixed. Impact on users who cannot use lakeFSFSThe PR description recommended lakeFSFS presigned as the alternative. This works for users who can install custom JARs, but as you mentioned, that may not be an option. From my understanding, these are the use cases that are impacted:
For these users, there is currently no direct-to-storage path on Spark 4. |
d833757 to
8fc4531
Compare
Integrate Spark 4.0.0 and 4.1.1 into the existing esti and compatibility test matrices instead of having a separate Spark 4 job. Uses repo_suffix to avoid dots in lakeFS repository names, and matrix include to limit Spark 4.x compat tests to lakeFS >= 1.78.0. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spark 4.1.1 ships with hadoop-aws 3.4.2 which requires the AWS SDK v2 bundle at runtime. Update the build to compile against the matching Hadoop version. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add contract-tests-hadoop342 Maven profile for testing the hadoopfs client against Hadoop 3.4.2 (used by Spark 4.1.1). The profile includes aws-java-sdk-bundle as a provided dependency since hadoop-aws 3.4.2 changed it from compile to provided scope. LakeFSFileSystemServerS3Test is excluded from this profile because Hadoop 3.4.2's S3A gets 403 errors when accessing minio in the test environment. This needs further investigation. Co-Authored-By: Claude Opus 4.6 <[email protected]>
73e3a61 to
4a3f4a1
Compare
Add fs.s3a.path.style.access=true to S3FSTestBase so Hadoop 3.4.2's S3A client (AWS SDK v2) uses path-style URLs when talking to MinIO. This fixes the 403 errors that caused LakeFSFileSystemServerS3Test to be excluded from the contract-tests-hadoop342 profile. Also remove unused gateway-test-spark4 host entries from docker-compose. Co-Authored-By: Claude Opus 4.6 <[email protected]>
On CI, AWS environment variables from the runner are picked up by Hadoop 3.4.2's SDK v2 credential chain before the Hadoop config credentials. This causes 403 errors when the CI runner's AWS creds are sent to MinIO instead of the test credentials. Explicitly set the credential provider to SimpleAWSCredentialsProvider to ensure only the Hadoop config access/secret keys are used. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds Spark 4.0.0 and 4.1.1 to the existing Spark integration test suite to verify lakeFS compatibility.
Closes https://github.com/treeverse/product/issues/1076
Findings
lakeFSFS is compatible with Spark 4 provided the AWS SDK V1 is available. lakeFS S3 Gateway is also compatible, however the redirect optimization no longer works.
What was tested
Spark 4.0.0 (Scala 2.13, Hadoop 3.4.1, AWS SDK v2) was added to the existing integration test suite (see #10175).
lakefs://, cluster has S3 creds)lakefs://, no S3 creds needed)s3a://)s3a://, lakeFS responding with 307 to presigned URL)Why S3 Gateway redirect optimization is broken
Spark 4.0.0 upgrades Hadoop from 3.3.4 to 3.4.1 (release notes) and removes the AWS SDK v1 bundle. Hadoop 3.4.0 migrated S3A to AWS SDK v2 (HADOOP-18073).
AWS SDK v2 does not follow HTTP redirects at the transport layer. This is intentional (aws-sdk-java-v2#975, aws-sdk-java-v2#989). Both HTTP clients used by S3A treat non-2xx responses as errors. When lakeFS returns a 307, Hadoop's
S3AUtils.translateException()treats it as an S3 region redirect, looks for anx-amz-bucket-regionheader that lakeFS doesn't set, and throwsAWSRedirectException: redirect to region null(S3AUtils.java).This is not fixable via configuration. The
QueryStringSignerTypesigner, which was previously used to prevent double-signing of presigned URLs, also no longer exists in SDK v2. The redirect test has been restricted to Spark 3 only.Impact
Users who can install custom JARs can migrate to lakeFSFS presigned mode, which provides the same direct-to-storage performance benefit as redirect mode. The migration requires adding the lakeFSFS JAR, switching URIs from
s3a://tolakefs://, and configuring the lakeFS endpoint and blockstore credentials.Users who cannot install custom JARs (notably Databricks SQL Warehouses, lakeFS Databricks docs) have no direct-to-storage path on Spark 4. All data will proxy through lakeFS, which is a significant performance and cost concern for large data volumes.
Changes
tag: 4totag: "4.0.0"andtag: "4.1.1"withrepo_suffixtoavoid dots in lakeFS repository names
contract-tests-hadoop342profile to hadoopfs client for testing against Hadoop 3.4.2 (with explicitaws-java-sdk-bundleandassertj-coretest deps)gateway-test-spark400andgateway-test-spark411