-
Notifications
You must be signed in to change notification settings - Fork 472
feat: provide inline_transaction model for IO optimizing #4774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4774 +/- ##
==========================================
+ Coverage 80.84% 80.94% +0.09%
==========================================
Files 330 331 +1
Lines 130545 130751 +206
Branches 130545 130751 +206
==========================================
+ Hits 105540 105836 +296
+ Misses 21269 21165 -104
- Partials 3736 3750 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
protos/table.proto
Outdated
// if < 200KB, store the transaction content inline | ||
bytes transaction_content = 19; | ||
// if >= 200KB, store the transaction content at the specified offset | ||
TransactionSection transaction_section = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we should just keep it as a separated file which seems to be more backwards compatible. Is there any other benefit in storing it at offset of the manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility we should write both for a while. We can stop writing the external file after adding a feature flag 6+ months in the future. I've described the sequence of steps in #3487
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting inline in the protobuf is necessary. We can put it at an offset and use existing optimizations to read it without an additional IOP.
protos/table.proto
Outdated
// The transaction that created this version. | ||
oneof inline_transaction { | ||
// if < 200KB, store the transaction content inline | ||
bytes transaction_content = 19; | ||
// if >= 200KB, store the transaction content at the specified offset | ||
TransactionSection transaction_section = 20; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just unconditionally store the transaction at an offset, IMO. This makes this simpler, and it's still possible to read the transaction when it is small.
We currently store the index metadata at an offset, and here's how it works right now:
- When we read the manifest file, we always read the first
block_size
bytes (4kb for local fs, 64kb for object storage) - We decode as much as we can from that last chunk. If the entire manifest file is < 64kb, this often means we get the entire content (including the index metadata) in 1 IOP.
- If that last chunk wasn't sufficient, we read the remainder.
This basically gives the same effect as the optionally inline message: If it's small, we automatically read the index and transaction information in the first IO request. But if it's large, we do it in a separate request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the place where we opportunistically read the index metadata from manifests:
lance/rust/lance/src/dataset.rs
Lines 511 to 536 in a385965
// If indices were also the last block, we can take the opportunity to | |
// decode them now and cache them. | |
if let Some(index_offset) = manifest.index_section { | |
if manifest_size - index_offset <= last_block.len() { | |
let offset_in_block = last_block.len() - (manifest_size - index_offset); | |
let message_len = | |
LittleEndian::read_u32(&last_block[offset_in_block..offset_in_block + 4]) | |
as usize; | |
let message_data = | |
&last_block[offset_in_block + 4..offset_in_block + 4 + message_len]; | |
let section = lance_table::format::pb::IndexSection::decode(message_data)?; | |
let indices: Vec<Index> = section | |
.indices | |
.into_iter() | |
.map(Index::try_from) | |
.collect::<Result<Vec<_>>>()?; | |
let ds_index_cache = session.index_cache.for_dataset(uri); | |
let metadata_key = crate::session::index_caches::IndexMetadataKey { | |
version: manifest_location.version, | |
}; | |
ds_index_cache | |
.insert_with_key(&metadata_key, Arc::new(indices)) | |
.await; | |
} | |
} |
protos/table.proto
Outdated
// if < 200KB, store the transaction content inline | ||
bytes transaction_content = 19; | ||
// if >= 200KB, store the transaction content at the specified offset | ||
TransactionSection transaction_section = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility we should write both for a while. We can stop writing the external file after adding a feature flag 6+ months in the future. I've described the sequence of steps in #3487
protos/table.proto
Outdated
message TransactionSection { | ||
// The summary of the transaction including operation type and some kvs in transaction_properties | ||
// This is used to get some information about the transaction without loading the whole transaction content | ||
map<string, string> summary = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @wjones127 , What do you think of this part.
Basically it's redundent content for operation and transaction_properties. The case I want to solve now may encount listing historical manifest and transaction summaries. I'm not sure the opportunistically read could actually help this listing case if the manifest is large(actually the larger manifest we might need to accelerate it more).
On the other hand, some information may be useful if we could directly read from manifest like commit message. So I added this summary thing for directly reading from manifest without reading transaction part.
cc @jackye1995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the list transactions use case, I can’t help like feeling the best thing would be to cache compacted summary data in some secondary file. Like we could generate some Lance file next to the manifest that contains all transaction summaries before it. Then to get most of the history you just need to query that. Then read the transactions of the next few uncached versions. What do you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much performance do we want for this use case? I always thought reading information like transaction summary is more for non performance critical workloads, like displaying info to the user, triggerring some optimization jobs based on the action performed, etc. that can stand a few more IOPS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we could generate some Lance file next to the manifest that contains all transaction summaries before it
I had the same ideas. I thought lance might provide a lazy summary file includes tags and branches, summaries and properties. And could use the index framework to maintain this file. This was only an early idea. I could raise a discussion for it. Then we should ignore the summary here.
How much performance do we want for this use case? I always thought reading information like transaction summary is more for non performance critical workloads, like displaying info to the user, triggerring some optimization jobs based on the action performed, etc. that can stand a few more IOPS
For my case, one page displays at least 20 versions. each version should get a summary info includes manifest and transaction. If files seperated, the IO costs might have a magnification of 20x. Let's say read a transaction file costs 100ms,that means 2 seconds magnification at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm inclined to not have the summary there. Just keep it in the transaction. Pulling it out into the manifest means that listing is still O(num versions)
IO requests. To make that fast, better to create some mechanism to query it.
non performance critical workloads, like displaying info to the user
I'd argue that displaying info to the user is somewhat performance critical, in that we'd like it to return fast enough to feel responsive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If files seperated, the IO costs might have a magnification of 20x. Let's say read a transaction file costs 100ms,that means 2 seconds magnification at least.
I'd argue that displaying info to the user is somewhat performance critical, in that we'd like it to return fast enough to feel responsive.
I was thinking we will just make the requests in parallel, so the latency won't really be 2 seconds, more like 200-300ms if anything got throttled and retried. But agree we need to make sure it is fast enough to feel responsive.
I thought lance might provide a lazy summary file includes tags and branches, summaries and properties. And could use the index framework to maintain this file.
that sounds like a nice idea, I originally created the concept of "system index" basically for such use cases.
e7024b1
to
fc35d19
Compare
9cd17b7
to
06e1a43
Compare
a93622b
to
e61260c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Ready for review @wjones127 cc @jackye1995 |
Hi, @wjones127, @jackye1995 , I'm working on issue #4308 #3487
In this PR I wanna proposing the inline_transaction model for IO optimizing. The motivation includes: