Spark: Add flag to handle missing files for `importSparkTable` #12212

jshmchenxi · 2025-02-10T02:19:46Z

When importing a partitioned Spark table into Iceberg via SparkTableUtil, any partition whose directory is missing on the underlying filesystem will cause the migration to fail:

Caused by: java.lang.RuntimeException: Unable to list files in partition: s3://bucket/table/partition=foo
	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:206)
	at org.apache.iceberg.spark.SparkTableUtil.listPartition(SparkTableUtil.java:309)
	at org.apache.iceberg.spark.SparkTableUtil.lambda$importSparkPartitions$37333fc7$1(SparkTableUtil.java:767)
	at org.apache.spark.sql.Dataset.$anonfun$flatMap$2(Dataset.scala:3484)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:375)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.FileNotFoundException: No such file or directory: s3://bucket/table/partition=foo
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3650)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:3373)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$null$22(S3AFileSystem.java:3344)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$23(S3AFileSystem.java:3343)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2478)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2497)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:3342)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)
	at org.apache.iceberg.data.TableMigrationUtil.listPartition(TableMigrationUtil.java:167)
	... 32 more

This typically happens when a partition directory has been deleted—either manually or by an automated cleanup—while its metadata still exists in the metastore.

Problem:

Migrating Spark tables to Iceberg crashes on partitions whose directories don’t exist.
Users must manually recreate directories or clean up metadata before retrying.

Solution:

Introduce a new boolean flag, ignoreMissingFiles, in the importSparkPartitions and importSparkTable methods.
When enabled, partitions whose locations cannot be listed will be skipped rather than causing a hard failure.

Next Steps:

Extend ignoreMissingFiles support to the Spark procedures snapshot and migrate in a follow-up PR.

manuzhang · 2025-02-10T03:34:19Z

@jshmchenxi Thanks for the fix. Can you add a test?

RussellSpitzer

Looks good although I agree we need a test to check that this is working as expected.

jshmchenxi · 2025-02-11T06:52:48Z

@manuzhang @RussellSpitzer Thanks for the suggestion! I've added test cases to cover this change.

data/src/test/java/org/apache/iceberg/data/TestTableMigrationUtil.java

manuzhang · 2025-02-11T07:18:46Z

@jshmchenxi Can we add an end-to-end test in TestSnapshotTableAction?

jshmchenxi · 2025-02-16T02:37:42Z

@jshmchenxi Can we add an end-to-end test in TestSnapshotTableAction?

@manuzhang I've added the end-to-end test. Please take a look.

jshmchenxi · 2025-02-18T01:03:14Z

Kindly ping @manuzhang @RussellSpitzer @stevenzwu

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

jshmchenxi · 2025-02-24T02:41:21Z

@RussellSpitzer Hi, does anything else need to be updated for this PR?

jshmchenxi · 2025-02-28T02:51:32Z

Hi @manuzhang @RussellSpitzer @ebyhr, just checking in on this PR. All feedback from the previous rounds has been addressed, and I believe it’s ready for the final review. I know you’re busy, but I’d really appreciate it if you could take a look when you have some time. Thanks so much!

github-actions · 2025-03-31T00:17:58Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

jshmchenxi · 2025-04-09T05:39:29Z

@Fokko Hi! When you get a chance, could you help review this PR? It’s been open for a little while. Appreciate it!

manuzhang · 2025-04-09T06:33:40Z

Let's wait for more time, since folks are busy with Iceberg Summit now.

jshmchenxi · 2025-04-22T00:39:05Z

Kindly pinging @RussellSpitzer @huaxingao — when you have a moment, could you please help review this change? Looking forward to getting it merged. Thanks in advance!

RussellSpitzer · 2025-04-22T16:20:27Z

Tests are currently failing, Could we get those fixed please?

RussellSpitzer · 2025-04-22T16:22:08Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

-              .filter(FileStatus::isFile)
-              .collect(Collectors.toList());
+      List<FileStatus> fileStatus;
+      if (fs.exists(partitionDir)) {


I'm actually a little worried now that we are going to silently error out on a permissions issue here. Should we have this behavior behind a flag?

We added a flag "checkDuplicateFiles" to Spark Table util.

I'm afraid we don't have a clean way of adding more parameters to this method method but i'm still not confident we should go from failure to warning for everyone all at once.

I think it may be best to have a flag in the caller so it looks something like

try { listPartition(ex) } Catch (FileNotFound ...) { if (ignoreMissingFiles) log.warn else throw FNF } ```

That makes sense. Should we add the flag to the Spark migrate and snapshot procedures as well?

That makes sense to me, but you can do that in a followup if you don't want this PR to get too large

Thanks for the suggestion! I've added the new flag to callers in SparkTableUtil.

RussellSpitzer · 2025-04-22T16:23:45Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

+                .filter(FileStatus::isFile)
+                .collect(Collectors.toList());
+      } else {
+        LOG.info(


This should probably be a "warn" at least

…tionUtil`

RussellSpitzer · 2025-05-02T22:09:11Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

@@ -20,6 +20,7 @@

 import static org.apache.spark.sql.functions.col;

+import java.io.FileNotFoundException;


Not a big deal since this change set is small, but usually we only do a change in one Spark version at a time, review and then backport.

Got it, I’ll follow this approach in the next PRs.

RussellSpitzer · 2025-05-02T22:12:37Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

@@ -2170,6 +2173,99 @@ public void testTableWithInt96Timestamp() throws IOException {
    }
  }

+  @Test
+  public void testSparkTableWithMissingFilesFailure() throws IOException {


testImportSparkTableWith...

Just want to make sure we note what's going on

RussellSpitzer · 2025-05-02T22:13:54Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+        Lists.newArrayList(new SimpleRecord(1, "a"), new SimpleRecord(2, "b"));
+
+    Dataset<Row> inputDF = spark.createDataFrame(records, SimpleRecord.class);
+    inputDF.select("data", "id").write().mode("overwrite").insertInto("parquet_table");


minor question here, do we need to "select"?

No, we don't. It was copied from other tests. I'll remove this and the name mapping part as well.

RussellSpitzer · 2025-05-02T22:14:44Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+                    new org.apache.spark.sql.catalyst.TableIdentifier("parquet_table"),
+                    table,
+                    stagingLocation))
+        .hasMessageContaining("Unable to list files in partition")


we can also check that the missing string contains partitionLocaltionPath

That makes sense!

RussellSpitzer · 2025-05-02T22:15:45Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+    Path partitionLocationPath = parquetTablePath.toPath().resolve("id=1234");
+    java.nio.file.Files.delete(partitionLocationPath);
+
+    NameMapping mapping = MappingUtil.create(table.schema());


I don't think you actually need the name mapping for this test

RussellSpitzer

I have a few nits here, but I think we are pretty close! Let's clean those up and we can get this merged

jshmchenxi · 2025-05-03T15:02:15Z

@RussellSpitzer Thanks for the review! I've pushed an update to address the comments.

RussellSpitzer · 2025-05-05T16:21:51Z

Thanks @jshmchenxi for the PR! Thank you to @manuzhang , @rohanag12 and @ebyhr for reviews!

…rkTable` (apache#12212) (apache#1591) Co-authored-by: Xi Chen <[email protected]>

github-actions bot added the data label Feb 10, 2025

jshmchenxi changed the title ~~Handle case where partition location is missing from the file system in TableMigrationUtil~~ Data: Handle case where partition location is missing for TableMigrationUtil Feb 10, 2025

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 5b7ff64 to 658010c Compare February 10, 2025 02:31

rohanag12 approved these changes Feb 10, 2025

View reviewed changes

RussellSpitzer reviewed Feb 10, 2025

View reviewed changes

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 36e2b7b to f3c2e11 Compare February 11, 2025 06:52

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 60ef66a to 23a1b11 Compare February 11, 2025 06:56

ebyhr reviewed Feb 11, 2025

View reviewed changes

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 23a1b11 to 6b96233 Compare February 11, 2025 07:04

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 6b96233 to 2aacf49 Compare February 16, 2025 02:34

github-actions bot added the spark label Feb 16, 2025

RussellSpitzer reviewed Feb 18, 2025

View reviewed changes

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java Outdated Show resolved Hide resolved

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 2 times, most recently from 5e58ba9 to e0c9f62 Compare February 21, 2025 01:31

github-actions bot added the stale label Mar 31, 2025

manuzhang requested a review from RussellSpitzer March 31, 2025 02:53

github-actions bot removed the stale label Apr 1, 2025

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from e0c9f62 to 30efd1c Compare April 22, 2025 15:52

RussellSpitzer reviewed Apr 22, 2025

View reviewed changes

jshmchenxi added 3 commits May 2, 2025 17:29

Data: Handle case where partition location is missing for `TableMigra…

eedecdd

…tionUtil`

Add end-to-end test to TestSnapshotTableAction

66abe94

Add log when skipping partition with missing location

ce991f6

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch 3 times, most recently from 00a7c87 to 084bc85 Compare May 2, 2025 09:52

Handle FileNotFoundException from caller

909f0c2

jshmchenxi force-pushed the bugfix/spark-util-partition-location-nonexist branch from 084bc85 to 909f0c2 Compare May 2, 2025 11:55

jshmchenxi changed the title ~~Data: Handle case where partition location is missing for TableMigrationUtil~~ Spark: Add flag to handle missing files for importSparkTable May 2, 2025

RussellSpitzer reviewed May 2, 2025

View reviewed changes

RussellSpitzer approved these changes May 2, 2025

View reviewed changes

Address comments

c63beab

RussellSpitzer approved these changes May 5, 2025

View reviewed changes

RussellSpitzer merged commit af32a07 into apache:main May 5, 2025
42 checks passed

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025

Data, Spark 3.4, 3.5: Add flag to handle missing files for `importSpa…

ca40754

…rkTable` (apache#12212) (apache#1591) Co-authored-by: Xi Chen <[email protected]>

		@@ -20,6 +20,7 @@

		import static org.apache.spark.sql.functions.col;

		import java.io.FileNotFoundException;

Spark: Add flag to handle missing files for importSparkTable #12212

Spark: Add flag to handle missing files for importSparkTable #12212

Uh oh!

Conversation

jshmchenxi commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Next Steps:

Uh oh!

manuzhang commented Feb 10, 2025

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

jshmchenxi commented Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manuzhang commented Feb 11, 2025

Uh oh!

jshmchenxi commented Feb 16, 2025

Uh oh!

jshmchenxi commented Feb 18, 2025

Uh oh!

Uh oh!

jshmchenxi commented Feb 24, 2025

Uh oh!

jshmchenxi commented Feb 28, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

jshmchenxi commented Apr 9, 2025

Uh oh!

manuzhang commented Apr 9, 2025

Uh oh!

jshmchenxi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Spark: Add flag to handle missing files for `importSparkTable` #12212

Spark: Add flag to handle missing files for `importSparkTable` #12212

jshmchenxi commented Feb 10, 2025 •

edited

Loading

jshmchenxi commented Apr 22, 2025 •

edited

Loading

RussellSpitzer Apr 22, 2025 •

edited

Loading