Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@HaoranYi
Copy link

@HaoranYi HaoranYi commented Oct 3, 2025

Problem

We want to avoid using Arc in the in-memory index to save 16 bytes per entry. However, the current disk flush implementation clones entries (requiring Arc), preventing this optimization. This PR changes disk flush to avoid entry cloning and refactors the eviction logic to minimize lock contention.

Summary of Changes

Refactor disk flush to avoid cloning entries and minimize read lock contention, enabling future memory optimization by removing Arc from in-memory index entries.

Performance Measurement

The mainnet results indicate that the read lock is held for only ~3 K µs, compared to 300 K–3.5 M µs total update time.
This shows that lock contention is minimal, and the time spent under the read lock is only a very small fraction of the overall update duration.

image

Fixes #

@HaoranYi HaoranYi force-pushed the disk_flush_no_entry_clone branch from 5a37af8 to cb65e7d Compare October 3, 2025 17:17
}

possible_evictions.insert(0, *k, Arc::clone(v));
possible_evictions.insert(0, *k);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my current idea is to put dirty as extra value and use that for deciding to obtain the entry under map lock and clear dirty, write to disk

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. it is a great idea.

@HaoranYi HaoranYi force-pushed the disk_flush_no_entry_clone branch from cb65e7d to 996d94a Compare October 3, 2025 18:36
return None;
}
let evictions_age: Vec<_> = {
let map = self.map_internal.read().unwrap();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the tradeoff - the read lock is now held during disk I/O operations...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To alleviate this, we chunk up the items and release the lock after each chunk.

@HaoranYi HaoranYi force-pushed the disk_flush_no_entry_clone branch 4 times, most recently from b8fb32e to 74ff955 Compare October 3, 2025 22:16
@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 29.70297% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (35c8486) to head (af3fcda).
⚠️ Report is 191 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #8330   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         838      838           
  Lines      368496   368539   +43     
=======================================
+ Hits       306667   306742   +75     
+ Misses      61829    61797   -32     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Process evictions_age_possible in chunks of 20 items instead of holding
the map_internal read lock for the entire eviction set. This reduces
lock contention by releasing and re-acquiring the lock between chunks.
@HaoranYi HaoranYi force-pushed the disk_flush_no_entry_clone branch from 74ff955 to f1d6a0d Compare October 7, 2025 14:36
@HaoranYi HaoranYi changed the title disk_flush_no_entry_clone Refactor disk flush to eliminate entry cloning and reduce read lock contention Oct 7, 2025
@HaoranYi HaoranYi force-pushed the disk_flush_no_entry_clone branch from 8526a05 to 7f6cf59 Compare October 7, 2025 15:42
possible_evictions.insert(0, *k, Arc::clone(v));
// Capture dirty and ref_count early during scan
let is_dirty = v.dirty();
let ref_count = v.ref_count();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in 99% of case flush_internal will skip the entry if ref count != 1:

  • for dirty case we will re-check the ref_count in should_evict_from_mem and return false for ref count != 1, so the only situation we would write to disk for ref count != 1 is when between gathering evictions and flushing it changed to 1
  • for non-dirty case you actually do the check if *ref_count != 1 { in that branch in flush_internal

so I think we could just skip it here and not put the ref_count into the vector at all

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. done in f8577c5


if !should_evict {
// not evicting, so don't write, even if dirty
drop(map);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helping a bit in readability, but at the same time it's kind of confusing that we drop manually the map guard just before return None, which will drop it anyway...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in f8577c5

// Entry was dirty at scan time, need to write to disk
// Lock the map briefly to get the full entry reference
let lock_measure = Measure::start("flush_read_lock");
let map = self.map_internal.read().unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it something like map_read_guard to highlight that we are holding locks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 5f62787

// since we know slot_list.len() == 1, we can create a stack-allocated array for single element.
let (slot, info) = slot_list[0];
let disk_entry = [(slot, info.into())];
let disk_ref_count = ref_count;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner way to handle lifetime of the map guard would be do do something like

let (disk_entry, disk_ref_count) = {
    let map_read_guard = ...
    if !... {
          return None;
    }
    ([(slot, info.into())], ref_count)
};
// unconditionally write to disk
 loop {
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we could move the map guard outside of the outer for loop as Option to re-use the lock for entries that we skip writing, so something like

let mut map_read_guard = Some(self.map_internal.read().unwrap());
for k, v in possible_evictions {
    ..
     if  ..should write.. {
         map_read_guard = None;
         ..write..
         map_read_guard = Some(self.map_internal.read().unwrap())
    }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I think option 1 is better in the way that we are yielding the read lock like before.
done in 8c566a0

@HaoranYi HaoranYi marked this pull request as ready for review October 23, 2025 14:31
Optimize eviction candidate filtering by rejecting entries with ref_count != 1
during the initial scan phase, before they reach flush_internal or evict_from_cache.

Changes:
- Updated FlushScanResult to store (Pubkey, bool) instead of (Pubkey, bool, RefCount)
- Modified gather_possible_evictions to filter ref_count != 1 early with clear rationale
- Removed ref_count parameter from PossibleEvictions::insert()
- Simplified flush_internal non-dirty path (no ref_count check needed)
- Removed redundant drop() calls before return statements (locks release automatically)
- Added comments explaining automatic lock release for better readability
- Updated test to match new tuple structure

Rationale:
In 99% of cases, entries with ref_count != 1 will be rejected later by:
- should_evict_from_mem() for dirty entries
- evict_from_cache() for non-dirty entries

By filtering early, we:
1. Reduce unnecessary work processing candidates that will be rejected
2. Avoid write lock contention in evict_from_cache for non-dirty entries
3. Simplify the code by removing redundant checks and explicit drops
Rename the variable holding the read guard from 'map' to 'map_read_guard'
to make it explicit that it's a guard holding a read lock on map_internal.

This improves code readability and follows Rust naming conventions for guards.
Refactor to use a scope block for managing the map_read_guard lifetime,
making the control flow clearer and ensuring locks are released as soon
as data extraction is complete.

Changes:
- Wrapped map access and data extraction in a scope block
- Locks (map_read_guard and slot_list) automatically release at block end
- Removed explicit drop() calls - no longer needed
- Disk write unconditionally happens after lock release
- Inverted clear_dirty() check for early return on non-dirty path

Benefits:
- Cleaner control flow with explicit scope boundaries
- Impossible to accidentally hold locks during disk I/O
- More idiomatic Rust with automatic guard drop
- Easier to understand lock lifetime at a glance
@HaoranYi HaoranYi requested a review from brooksprumo October 23, 2025 15:28
@brooksprumo
Copy link

Is this PR running on a validator against mnb that has the disk index enabled?

@brooksprumo brooksprumo requested a review from roryharr October 23, 2025 18:10
@HaoranYi
Copy link
Author

Is this PR running on a validator against mnb that has the disk index enabled?

yes. it is running and the id is 9eNX7h5wHH4GTqESybhHWZVvEX7nzxB6e7Q4J8RfAH2u

@HaoranYi
Copy link
Author

image

looks like the read lock time cut down by 50% after all these PR review changes, 3k us -> 1.5k us.
the rightmost 30 mins is when restart happened.

@HaoranYi HaoranYi changed the title Refactor disk flush to eliminate entry cloning and reduce read lock contention Refactor disk flush Oct 23, 2025
@brooksprumo
Copy link

Also, I appreciate the PR title change! IMO when I see "refactor", I interpret that to mean no behavioral change. For this PR though, we are changing behavior quite a bit.

Wdyt about a title like:

"Eliminate entry cloning when flushing index"

@HaoranYi HaoranYi changed the title Refactor disk flush Eliminate entry cloning when flushing index Oct 23, 2025
Preserve and reword the important comment from the original code that
explains how concurrent modifications are handled when clearing the
dirty flag and writing to disk.
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link

@kskalski kskalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, looks good!

@HaoranYi HaoranYi added this pull request to the merge queue Oct 24, 2025
Merged via the queue into anza-xyz:master with commit bdbedd6 Oct 24, 2025
43 checks passed
@HaoranYi HaoranYi deleted the disk_flush_no_entry_clone branch October 24, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants