Eliminate entry cloning when flushing index #8330

HaoranYi · 2025-10-03T17:13:22Z

Problem

We want to avoid using Arc in the in-memory index to save 16 bytes per entry. However, the current disk flush implementation clones entries (requiring Arc), preventing this optimization. This PR changes disk flush to avoid entry cloning and refactors the eviction logic to minimize lock contention.

Summary of Changes

Refactor disk flush to avoid cloning entries and minimize read lock contention, enabling future memory optimization by removing Arc from in-memory index entries.

Performance Measurement

The mainnet results indicate that the read lock is held for only ~3 K µs, compared to 300 K–3.5 M µs total update time.
This shows that lock contention is minimal, and the time spent under the read lock is only a very small fraction of the overall update duration.

Fixes #

kskalski · 2025-10-03T18:24:12Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

            }

-            possible_evictions.insert(0, *k, Arc::clone(v));
+            possible_evictions.insert(0, *k);


my current idea is to put dirty as extra value and use that for deciding to obtain the entry under map lock and clear dirty, write to disk

yeah. it is a great idea.

HaoranYi · 2025-10-03T18:44:14Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

-                                    return None;
-                                }
+            let evictions_age: Vec<_> = {
+                let map = self.map_internal.read().unwrap();


Here is the tradeoff - the read lock is now held during disk I/O operations...

To alleviate this, we chunk up the items and release the lock after each chunk.

codecov-commenter · 2025-10-03T23:06:13Z

Codecov Report

❌ Patch coverage is 29.70297% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (35c8486) to head (af3fcda).
⚠️ Report is 191 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #8330   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         838      838           
  Lines      368496   368539   +43     
=======================================
+ Hits       306667   306742   +75     
+ Misses      61829    61797   -32

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

accounts-db/src/accounts_index/in_mem_accounts_index.rs

Process evictions_age_possible in chunks of 20 items instead of holding the map_internal read lock for the entire eviction set. This reduces lock contention by releasing and re-acquiring the lock between chunks.

kskalski · 2025-10-23T01:45:45Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

-            possible_evictions.insert(0, *k, Arc::clone(v));
+            // Capture dirty and ref_count early during scan
+            let is_dirty = v.dirty();
+            let ref_count = v.ref_count();


I think in 99% of case flush_internal will skip the entry if ref count != 1:

for dirty case we will re-check the ref_count in should_evict_from_mem and return false for ref count != 1, so the only situation we would write to disk for ref count != 1 is when between gathering evictions and flushing it changed to 1

for non-dirty case you actually do the check if *ref_count != 1 { in that branch in flush_internal

so I think we could just skip it here and not put the ref_count into the vector at all

yeah. done in f8577c5

kskalski · 2025-10-23T01:50:20Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

+
+                        if !should_evict {
+                            // not evicting, so don't write, even if dirty
+                            drop(map);


It might be helping a bit in readability, but at the same time it's kind of confusing that we drop manually the map guard just before return None, which will drop it anyway...

done in f8577c5

kskalski · 2025-10-23T01:51:36Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

+                        // Entry was dirty at scan time, need to write to disk
+                        // Lock the map briefly to get the full entry reference
+                        let lock_measure = Measure::start("flush_read_lock");
+                        let map = self.map_internal.read().unwrap();


call it something like map_read_guard to highlight that we are holding locks

done in 5f62787

kskalski · 2025-10-23T01:55:40Z

accounts-db/src/accounts_index/in_mem_accounts_index.rs

+                            // since we know slot_list.len() == 1, we can create a stack-allocated array for single element.
+                            let (slot, info) = slot_list[0];
+                            let disk_entry = [(slot, info.into())];
+                            let disk_ref_count = ref_count;


I think a cleaner way to handle lifetime of the map guard would be do do something like

let (disk_entry, disk_ref_count) = { let map_read_guard = ... if !... { return None; } ([(slot, info.into())], ref_count) }; // unconditionally write to disk loop { }

alternatively, we could move the map guard outside of the outer for loop as Option to re-use the lock for entries that we skip writing, so something like

let mut map_read_guard = Some(self.map_internal.read().unwrap()); for k, v in possible_evictions { .. if ..should write.. { map_read_guard = None; ..write.. map_read_guard = Some(self.map_internal.read().unwrap()) } }

yes. I think option 1 is better in the way that we are yielding the read lock like before.
done in 8c566a0

Optimize eviction candidate filtering by rejecting entries with ref_count != 1 during the initial scan phase, before they reach flush_internal or evict_from_cache. Changes: - Updated FlushScanResult to store (Pubkey, bool) instead of (Pubkey, bool, RefCount) - Modified gather_possible_evictions to filter ref_count != 1 early with clear rationale - Removed ref_count parameter from PossibleEvictions::insert() - Simplified flush_internal non-dirty path (no ref_count check needed) - Removed redundant drop() calls before return statements (locks release automatically) - Added comments explaining automatic lock release for better readability - Updated test to match new tuple structure Rationale: In 99% of cases, entries with ref_count != 1 will be rejected later by: - should_evict_from_mem() for dirty entries - evict_from_cache() for non-dirty entries By filtering early, we: 1. Reduce unnecessary work processing candidates that will be rejected 2. Avoid write lock contention in evict_from_cache for non-dirty entries 3. Simplify the code by removing redundant checks and explicit drops

Rename the variable holding the read guard from 'map' to 'map_read_guard' to make it explicit that it's a guard holding a read lock on map_internal. This improves code readability and follows Rust naming conventions for guards.

Refactor to use a scope block for managing the map_read_guard lifetime, making the control flow clearer and ensuring locks are released as soon as data extraction is complete. Changes: - Wrapped map access and data extraction in a scope block - Locks (map_read_guard and slot_list) automatically release at block end - Removed explicit drop() calls - no longer needed - Disk write unconditionally happens after lock release - Inverted clear_dirty() check for early return on non-dirty path Benefits: - Cleaner control flow with explicit scope boundaries - Impossible to accidentally hold locks during disk I/O - More idiomatic Rust with automatic guard drop - Easier to understand lock lifetime at a glance

accounts-db/src/accounts_index/in_mem_accounts_index.rs

brooksprumo · 2025-10-23T18:09:33Z

Is this PR running on a validator against mnb that has the disk index enabled?

Co-authored-by: Brooks <[email protected]>

HaoranYi · 2025-10-23T18:54:52Z

Is this PR running on a validator against mnb that has the disk index enabled?

yes. it is running and the id is 9eNX7h5wHH4GTqESybhHWZVvEX7nzxB6e7Q4J8RfAH2u

HaoranYi · 2025-10-23T19:06:44Z

looks like the read lock time cut down by 50% after all these PR review changes, 3k us -> 1.5k us.
the rightmost 30 mins is when restart happened.

brooksprumo · 2025-10-23T19:48:39Z

Also, I appreciate the PR title change! IMO when I see "refactor", I interpret that to mean no behavioral change. For this PR though, we are changing behavior quite a bit.

Wdyt about a title like:

"Eliminate entry cloning when flushing index"

accounts-db/src/accounts_index/in_mem_accounts_index.rs

Preserve and reword the important comment from the original code that explains how concurrent modifications are handled when clearing the dirty flag and writing to disk.

brooksprumo

kskalski

Cool, looks good!

HaoranYi force-pushed the disk_flush_no_entry_clone branch from 5a37af8 to cb65e7d Compare October 3, 2025 17:17

kskalski reviewed Oct 3, 2025

View reviewed changes

HaoranYi force-pushed the disk_flush_no_entry_clone branch from cb65e7d to 996d94a Compare October 3, 2025 18:36

HaoranYi commented Oct 3, 2025

View reviewed changes

HaoranYi force-pushed the disk_flush_no_entry_clone branch 4 times, most recently from b8fb32e to 74ff955 Compare October 3, 2025 22:16

kskalski reviewed Oct 5, 2025

View reviewed changes

accounts-db/src/accounts_index/in_mem_accounts_index.rs Show resolved Hide resolved

HaoranYi mentioned this pull request Oct 6, 2025

Refactor test_gather_possible_evictions to use HashMap for cleaner setup #8350

Merged

HaoranYi added 2 commits October 7, 2025 14:34

disk flush no entry clone

5cff507

Chunk disk flush evictions to reduce read lock contention

f1d6a0d

Process evictions_age_possible in chunks of 20 items instead of holding the map_internal read lock for the entire eviction set. This reduces lock contention by releasing and re-acquiring the lock between chunks.

HaoranYi force-pushed the disk_flush_no_entry_clone branch from 74ff955 to f1d6a0d Compare October 7, 2025 14:36

HaoranYi changed the title ~~disk_flush_no_entry_clone~~ Refactor disk flush to eliminate entry cloning and reduce read lock contention Oct 7, 2025

Refactor eviction to avoid holding read lock during disk updates

7f6cf59

HaoranYi force-pushed the disk_flush_no_entry_clone branch from 8526a05 to 7f6cf59 Compare October 7, 2025 15:42

kskalski reviewed Oct 23, 2025

View reviewed changes

HaoranYi marked this pull request as ready for review October 23, 2025 14:31

HaoranYi added 3 commits October 23, 2025 15:00

Rename 'map' to 'map_read_guard' for better clarity

5f62787

Rename the variable holding the read guard from 'map' to 'map_read_guard' to make it explicit that it's a guard holding a read lock on map_internal. This improves code readability and follows Rust naming conventions for guards.

HaoranYi requested a review from brooksprumo October 23, 2025 15:28

brooksprumo reviewed Oct 23, 2025

View reviewed changes

brooksprumo requested a review from roryharr October 23, 2025 18:10

HaoranYi and others added 4 commits October 23, 2025 13:26

Update accounts-db/src/accounts_index/in_mem_accounts_index.rs

5e1b751

Co-authored-by: Brooks <[email protected]>

Update accounts-db/src/accounts_index/in_mem_accounts_index.rs

9071855

Co-authored-by: Brooks <[email protected]>

Update accounts-db/src/accounts_index/in_mem_accounts_index.rs

954a822

Co-authored-by: Brooks <[email protected]>

Update accounts-db/src/accounts_index/in_mem_accounts_index.rs

9ad16e1

Co-authored-by: Brooks <[email protected]>

HaoranYi and others added 6 commits October 23, 2025 13:30

Update accounts-db/src/accounts_index/in_mem_accounts_index.rs

c6702e4

Co-authored-by: Brooks <[email protected]>

pr

0817c6c

pr

8c077d7

pr: switch if case

7d20266

Apply suggestions from code review

fbbfab9

Co-authored-by: Brooks <[email protected]>

Use into_iter() instead of iter() for evictions_age_possible

62eaa9a

HaoranYi changed the title ~~Refactor disk flush to eliminate entry cloning and reduce read lock contention~~ Refactor disk flush Oct 23, 2025

brooksprumo reviewed Oct 23, 2025

View reviewed changes

accounts-db/src/accounts_index/in_mem_accounts_index.rs Show resolved Hide resolved

HaoranYi changed the title ~~Refactor disk flush~~ Eliminate entry cloning when flushing index Oct 23, 2025

Add comment explaining race condition handling in dirty flag clearing

af3fcda

Preserve and reword the important comment from the original code that explains how concurrent modifications are handled when clearing the dirty flag and writing to disk.

brooksprumo approved these changes Oct 23, 2025

View reviewed changes

kskalski approved these changes Oct 23, 2025

View reviewed changes

roryharr approved these changes Oct 23, 2025

View reviewed changes

HaoranYi added this pull request to the merge queue Oct 24, 2025

Merged via the queue into anza-xyz:master with commit bdbedd6 Oct 24, 2025
43 checks passed

HaoranYi deleted the disk_flush_no_entry_clone branch October 24, 2025 13:59

Eliminate entry cloning when flushing index #8330

Eliminate entry cloning when flushing index #8330

Uh oh!

Conversation

HaoranYi commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Performance Measurement

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brooksprumo commented Oct 23, 2025

Uh oh!

HaoranYi commented Oct 23, 2025

Uh oh!

HaoranYi commented Oct 23, 2025

Uh oh!

brooksprumo commented Oct 23, 2025

Uh oh!

Uh oh!

brooksprumo left a comment

Choose a reason for hiding this comment

Uh oh!

kskalski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HaoranYi commented Oct 3, 2025 •

edited

Loading

codecov-commenter commented Oct 3, 2025 •

edited

Loading