-
Notifications
You must be signed in to change notification settings - Fork 624
Add parallelized restore capability to ghe-restore-storage #635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on this. I think it looks good to me 👍 .
Have you tried this out with a large amount of data, to get an idea of the improvement? Given this is just rsync'ing data from /data/user/storage
, you could generate a decent chunk of random data with dd
and confirm it's doing what you expect.
Was originally bottlenecked by my home connection speed for effectively testing this, but wound up spinning up an EC2 instance with a 5Gbit NIC to remove that problem :) With that said, disk performance on my backup host will still likely be a bottleneck as I'm seeing transfer and disk speeds plummet the more I use my test instance. Added 6GB worth of 1GB files to random locations for the storage restore into the
Time without parallelism (~9 min):
Time with parallelism (3 min 40 sec):
I'll note that I burned a fair bit of time on this getting I've also verified that restores to an HA environment still function (with parallelism enabled or disabled) and are otherwise unaffected by these updates 🎉 |
@maclarel I think you can just install |
This introduces a parallelized restore of
storage
data, with a number of rsync threads equal to the number of storage nodes. This is the same logic used forghe-restore-repositories
, simply ported over toghe-restore-storage
.For customers that are heavy users of LFS in a clustered environment, this can have significant performance improvements, with a reduction in run time equivalent to the number of nodes. Specifically,
rsync
only utilizes a single thread ofsshd
, so when high transfer speeds are possible it is likely thatsshd
will become CPU bound resulting in limited transfer speed.For example, restoring ~5TB of data across 5 storage nodes would complete in approximately 16 hours assuming a transfer speed of 100MB/s (roughly where we see
sshd
become CPU bound) as the restores would be run sequentially. Assuming sufficient bandwidth for transfers at 500MB/s (achievable on a 10Gbit connection) this could reduce the overall time to approximately 3 hours as all 5rsync
invocations would be run simultaneously and would utilize 1 thread per server effectively quintupling performance.Verbose log confirms that all 3 are being kicked off at the same time, which aligns what what is seen for
ghe-restore-repositories
behaviour: