Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented May 15, 2025

This change adds support for row lineage inheritance in the Avro reader.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 16, 2025
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jun 23, 2025
@github-actions github-actions bot removed the stale label Jun 27, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the avro-row-lineage branch 2 times, most recently from 0b9123c to 75fe689 Compare June 30, 2025 13:52
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review June 30, 2025 13:54
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. just a couple nit comments

@stevenzwu stevenzwu added this to the Iceberg 1.10.0 milestone Jun 30, 2025
Comment on lines +232 to +234
addFileFieldReadersToPlan(readPlan, record.getFields(), fieldReaders, idToPos, idToConstant);
addMissingFileReadersToPlan(readPlan, idToPos, expected, idToConstant, convert);
return readPlan;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't end up refactoring like we do in ParquetValueReaders, it ended up being a bit hard to read considering avro positions compared to just splitting into two separate methods, one where we just add the file field readers to the read plan first and then we add the missing readers to the plan. That also eliminates cyclomatic complexity

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jul 1, 2025

Some related tests look to be failing after the update with refactoring, taking a look

@amogh-jahagirdar
Copy link
Contributor Author

Still figuring out what the issue is, even after reverting back to the un-refactored code some of the spark 3.4 avro tests still fail with the following. Note, the same Spark 3.5 tests consistently pass.

java.io.IOException: Block read partially, the data may be corrupt
org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
	at app//org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:237)
	at app//org.apache.iceberg.avro.AvroIterable$AvroRangeIterator.hasNext(AvroIterable.java:131)
	at app//org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:193)
	at app//org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:64)
	at app//org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
	at app//org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:131)
	at app//org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
	at app//org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
	at app//scala.Option.exists(Option.scala:376)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at app//org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at app//scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at app//org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at app//org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at app//org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
	at app//org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
	at app//org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at app//org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at app//org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at app//org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at app//org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at app//org.apache.spark.scheduler.Task.run(Task.scala:139)
	at app//org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at app//org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at app//org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at [email protected]/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: Block read partially, the data may be corrupt
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222)
	... 31 more

@stevenzwu stevenzwu merged commit e64181d into apache:main Jul 1, 2025
42 checks passed
@stevenzwu
Copy link
Contributor

thanks @amogh-jahagirdar for the contribution and @nastra for the review

@vlad-lyutenko
Copy link

Still figuring out what the issue is, even after reverting back to the un-refactored code some of the spark 3.4 avro tests still fail with the following. Note, the same Spark 3.5 tests consistently pass.

java.io.IOException: Block read partially, the data may be corrupt
org.apache.avro.AvroRuntimeException: java.io.IOException: Block read partially, the data may be corrupt
	at app//org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:237)
	at app//org.apache.iceberg.avro.AvroIterable$AvroRangeIterator.hasNext(AvroIterable.java:131)
	at app//org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:193)
	at app//org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:64)
	at app//org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
	at app//org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:131)
	at app//org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
	at app//org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
	at app//scala.Option.exists(Option.scala:376)
	at app//org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at app//org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at app//scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at app//org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at app//org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at app//org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
	at app//org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
	at app//org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at app//org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at app//org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at app//org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at app//org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at app//org.apache.spark.scheduler.Task.run(Task.scala:139)
	at app//org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at app//org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at app//org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at [email protected]/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.IOException: Block read partially, the data may be corrupt
	at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222)
	... 31 more

Hi @amogh-jahagirdar we have the same exception in our tests (in Trino) after we upgrade to iceberg containing this changes.

And it looks like this: as far as I understand this change - you try to read row_id :

  public static ValueReader<Long> rowIds(Long baseRowId, ValueReader<?> idReader) {
    if (baseRowId != null) {
      return new RowIdReader(baseRowId, (ValueReader<Long>) idReader);
    } else {
      return ValueReaders.constant(null);
    }
  }  

Either as constant or by calculating it as first_row_id + position.
But in our case it's failing for situation when _row_id is already presents in data file and we want just to read it as it is (without calculating it on the fly, because the value could actually be incorrect, for example after update).

And in this case test fails, because in previous version - we just read the field with LongReader but now it's going to branch:

    } else {
      return ValueReaders.constant(null);
    }

instead of pure reading.

Maybe you can suggest some way how to use
AvroIterable -> ValueReaders as pure reader, without this functionality.

Big thanks in advance.

@vlad-lyutenko
Copy link

Maybe you can suggest some way how to use
AvroIterable -> ValueReaders as pure reader, without this functionality.

To be more specific, when we use AvroIterable -> ValueReaders to just read data file we don't know first_row_id from metadata, so
Long firstRowId = (Long) idToConstant.get(fieldId);

doesn't contains this information and we always end up with

  } else {
      return ValueReaders.constant(null);
    }

Instead of :
return new RowIdReader(baseRowId, (ValueReader<Long>) idReader);

which could actually read row_id from file.

@vlad-lyutenko
Copy link

Maybe you can suggest some way how to use
AvroIterable -> ValueReaders as pure reader, without this functionality.

Sorry for disturbing you I found correct API, we can use :
PlannedDataReader.create(readSchema, idToConstant.buildOrThrow()))

to adjust all needed parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants