Add parallelized restore capability to ghe-restore-storage #635

maclarel · 2020-09-08T17:49:48Z

This introduces a parallelized restore of storage data, with a number of rsync threads equal to the number of storage nodes. This is the same logic used for ghe-restore-repositories, simply ported over to ghe-restore-storage.

For customers that are heavy users of LFS in a clustered environment, this can have significant performance improvements, with a reduction in run time equivalent to the number of nodes. Specifically, rsync only utilizes a single thread of sshd, so when high transfer speeds are possible it is likely that sshd will become CPU bound resulting in limited transfer speed.

For example, restoring ~5TB of data across 5 storage nodes would complete in approximately 16 hours assuming a transfer speed of 100MB/s (roughly where we see sshd become CPU bound) as the restores would be run sequentially. Assuming sufficient bandwidth for transfers at 500MB/s (achievable on a 10Gbit connection) this could reduce the overall time to approximately 3 hours as all 5 rsync invocations would be run simultaneously and would utilize 1 thread per server effectively quintupling performance.

Verbose log confirms that all 3 are being kicked off at the same time, which aligns what what is seen for ghe-restore-repositories behaviour:

Sep 08 17:38:45 ghe-restore-storage: sent 3,418 bytes  received 73 bytes  2,327.33 bytes/sec
Sep 08 17:38:45 ghe-restore-storage: total size is 14,652,931  speedup is 4,197.34
Sep 08 17:38:45 ghe-restore-storage: sending incremental file list
Sep 08 17:38:45 ghe-restore-storage:
Sep 08 17:38:45 ghe-restore-storage: sent 3,418 bytes  received 73 bytes  2,327.33 bytes/sec
Sep 08 17:38:45 ghe-restore-storage: total size is 14,652,931  speedup is 4,197.34
Sep 08 17:38:45 ghe-restore-storage: sending incremental file list
Sep 08 17:38:45 ghe-restore-storage:
Sep 08 17:38:45 ghe-restore-storage: sent 3,421 bytes  received 76 bytes  2,331.33 bytes/sec
Sep 08 17:38:45 ghe-restore-storage: total size is 14,652,931  speedup is 4,190.14

omgitsads

Sorry for the delay on this. I think it looks good to me 👍 .

Have you tried this out with a large amount of data, to get an idea of the improvement? Given this is just rsync'ing data from /data/user/storage, you could generate a decent chunk of random data with dd and confirm it's doing what you expect.

maclarel · 2020-09-17T14:19:19Z

Was originally bottlenecked by my home connection speed for effectively testing this, but wound up spinning up an EC2 instance with a 5Gbit NIC to remove that problem :) With that said, disk performance on my backup host will still likely be a bottleneck as I'm seeing transfer and disk speeds plummet the more I use my test instance.

Added 6GB worth of 1GB files to random locations for the storage restore into the current backup dir:

dd if=/dev/zero of=data/current/storage/0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat bs=1G count=1
dd if=/dev/zero of=data/current/storage/6/61/5f/615fd57d6491104c28e6e106test2.dat bs=1G count=1
dd if=/dev/zero of=data/current/storage/f/f3/98/f3985f87c02780d404a7ffe6test3.dat bs=1G count=1
dd if=/dev/zero of=data/current/storage/3/3c/70/3c7075f9865d684417eb5289test4.dat bs=1G count=1
dd if=/dev/zero of=data/current/storage/d/d3/c7/d3c70c81ace70adc9afe304test5.dat bs=1G count=1
dd if=/dev/zero of=data/current/storage/7/76/31/7631588f512e56afb1405c8test6.dat bs=1G count=1

Time without parallelism (~9 min):

Sep 17 16:43:50 ghe-restore-storage: * Transferring data to storage-server-0b2851a6-3883-11ea-8074-0e9b925ad0bd ...
Sep 17 16:45:39 ghe-restore-storage: sending incremental file list
Sep 17 16:45:39 ghe-restore-storage: 0/06/7a/
Sep 17 16:45:39 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 16:45:39 ghe-restore-storage: 3/3c/70/
Sep 17 16:45:39 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 16:45:39 ghe-restore-storage: 6/61/5f/
Sep 17 16:45:39 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 16:45:39 ghe-restore-storage: 7/76/31/
Sep 17 16:45:39 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 16:45:39 ghe-restore-storage: d/d3/c7/
Sep 17 16:45:39 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 16:45:39 ghe-restore-storage: f/f3/98/
Sep 17 16:45:39 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 16:45:39 ghe-restore-storage:
Sep 17 16:45:39 ghe-restore-storage: sent 6,444,027,991 bytes  received 234 bytes  58,849,572.83 bytes/sec
Sep 17 16:45:39 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 16:45:40 ghe-restore-storage: * Transferring data to storage-server-0b81cab0-3883-11ea-8376-021bb6bfffc5 ...
Sep 17 16:47:25 ghe-restore-storage: sending incremental file list
Sep 17 16:47:25 ghe-restore-storage: 0/06/7a/
Sep 17 16:47:25 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 16:47:25 ghe-restore-storage: 3/3c/70/
Sep 17 16:47:25 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 16:47:25 ghe-restore-storage: 6/61/5f/
Sep 17 16:47:25 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 16:47:25 ghe-restore-storage: 7/76/31/
Sep 17 16:47:25 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 16:47:25 ghe-restore-storage: d/d3/c7/
Sep 17 16:47:25 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 16:47:25 ghe-restore-storage: f/f3/98/
Sep 17 16:47:25 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 16:47:25 ghe-restore-storage:
Sep 17 16:47:25 ghe-restore-storage: sent 6,444,027,991 bytes  received 234 bytes  61,080,836.26 bytes/sec
Sep 17 16:47:25 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 16:47:26 ghe-restore-storage: * Transferring data to storage-server-0e0139d8-3883-11ea-b6f4-0ee2882f8799 ...
Sep 17 16:52:52 ghe-restore-storage: sending incremental file list
Sep 17 16:52:52 ghe-restore-storage: 0/06/7a/
Sep 17 16:52:52 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 16:52:52 ghe-restore-storage: 3/3c/70/
Sep 17 16:52:52 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 16:52:52 ghe-restore-storage: 6/61/5f/
Sep 17 16:52:52 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 16:52:52 ghe-restore-storage: 7/76/31/
Sep 17 16:52:52 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 16:52:52 ghe-restore-storage: d/d3/c7/
Sep 17 16:52:52 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 16:52:52 ghe-restore-storage: f/f3/98/
Sep 17 16:52:52 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 16:52:52 ghe-restore-storage:
Sep 17 16:52:52 ghe-restore-storage: sent 6,444,027,991 bytes  received 234 bytes  19,736,686.75 bytes/sec
Sep 17 16:52:52 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 16:52:52 ghe-restore-storage: Finalizing routes

Time with parallelism (3 min 40 sec):

GHE_PARALLEL_ENABLED=yes
GHE_PARALLEL_MAX_JOBS=3
GHE_PARALLEL_RSYNC_MAX_JOBS=3
GHE_PARALLEL_MAX_LOAD=75

Sep 17 18:25:36 ghe-restore-storage: * Transferring data to storage-server-0b2851a6-3883-11ea-8074-0e9b925ad0bd ...
Sep 17 18:25:36 ghe-restore-storage: * Transferring data to storage-server-0e0139d8-3883-11ea-b6f4-0ee2882f8799 ...
Sep 17 18:25:36 ghe-restore-storage: * Transferring data to storage-server-0b81cab0-3883-11ea-8376-021bb6bfffc5 ...
Sep 17 18:29:15 ghe-restore-storage: sending incremental file list
Sep 17 18:29:15 ghe-restore-storage: 0/06/7a/
Sep 17 18:29:15 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 18:29:15 ghe-restore-storage: 3/3c/70/
Sep 17 18:29:15 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 18:29:15 ghe-restore-storage: 6/61/5f/
Sep 17 18:29:15 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 18:29:15 ghe-restore-storage: 7/76/31/
Sep 17 18:29:15 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 18:29:15 ghe-restore-storage: d/d3/c7/
Sep 17 18:29:15 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 18:29:15 ghe-restore-storage: f/f3/98/
Sep 17 18:29:15 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 18:29:15 ghe-restore-storage:
Sep 17 18:29:15 ghe-restore-storage: sent 6,444,027,999 bytes  received 234 bytes  29,357,759.60 bytes/sec
Sep 17 18:29:15 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 18:29:15 ghe-restore-storage: sending incremental file list
Sep 17 18:29:15 ghe-restore-storage: 0/06/7a/
Sep 17 18:29:15 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 18:29:15 ghe-restore-storage: 3/3c/70/
Sep 17 18:29:15 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 18:29:15 ghe-restore-storage: 6/61/5f/
Sep 17 18:29:15 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 18:29:15 ghe-restore-storage: 7/76/31/
Sep 17 18:29:15 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 18:29:15 ghe-restore-storage: d/d3/c7/
Sep 17 18:29:15 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 18:29:15 ghe-restore-storage: f/f3/98/
Sep 17 18:29:15 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 18:29:15 ghe-restore-storage:
Sep 17 18:29:15 ghe-restore-storage: sent 6,444,027,999 bytes  received 234 bytes  29,357,759.60 bytes/sec
Sep 17 18:29:15 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 18:29:16 ghe-restore-storage: sending incremental file list
Sep 17 18:29:16 ghe-restore-storage: 0/06/7a/
Sep 17 18:29:16 ghe-restore-storage: 0/06/7a/067a71e0a4e77d17cf5e1ab4test1.dat
Sep 17 18:29:16 ghe-restore-storage: 3/3c/70/
Sep 17 18:29:16 ghe-restore-storage: 3/3c/70/3c7075f9865d684417eb5289test4.dat
Sep 17 18:29:16 ghe-restore-storage: 6/61/5f/
Sep 17 18:29:16 ghe-restore-storage: 6/61/5f/615fd57d6491104c28e6e106test2.dat
Sep 17 18:29:16 ghe-restore-storage: 7/76/31/
Sep 17 18:29:16 ghe-restore-storage: 7/76/31/7631588f512e56afb1405c8test6.dat
Sep 17 18:29:16 ghe-restore-storage: d/d3/c7/
Sep 17 18:29:16 ghe-restore-storage: d/d3/c7/d3c70c81ace70adc9afe304test5.dat
Sep 17 18:29:16 ghe-restore-storage: f/f3/98/
Sep 17 18:29:16 ghe-restore-storage: f/f3/98/f3985f87c02780d404a7ffe6test3.dat
Sep 17 18:29:16 ghe-restore-storage:
Sep 17 18:29:16 ghe-restore-storage: sent 6,444,027,999 bytes  received 234 bytes  29,224,617.84 bytes/sec
Sep 17 18:29:16 ghe-restore-storage: total size is 6,457,562,678  speedup is 1.00
Sep 17 18:29:16 ghe-restore-storage: Finalizing routes

I'll note that I burned a fair bit of time on this getting moreutils installed since a package that includes parallel isn't available for RHEL-based distros (e.g. RHEL, CentOS, Amazon Linux), so you're required to compile it from source as we can't use GNU Parallel (which is the only version available through a package manager as far as I can tell). We should consider either having the requirements be a Debian based system, or providing the required binaries as part of backup-utils. cc @ryansimmen

I've also verified that restores to an HA environment still function (with parallelism enabled or disabled) and are otherwise unaffected by these updates 🎉

ryansimmen · 2020-09-17T15:05:31Z

@maclarel I think you can just install moreutils-parallel for RHEL instead of compiling from source.

Adding parallelized restore capability to ghe-restore-storage.

c429a46

maclarel requested a review from omgitsads September 8, 2020 17:49

maclarel marked this pull request as ready for review September 14, 2020 13:57

Merge branch 'master' into parallelize_storage_restore

26db3e4

omgitsads approved these changes Sep 17, 2020

View reviewed changes

maclarel merged commit d95134c into master Sep 17, 2020

maclarel deleted the parallelize_storage_restore branch September 17, 2020 20:02

jianghao0718 added the enhancement label Sep 23, 2020

jianghao0718 changed the title ~~Add parallelized restore capability to ghe-restore-storage~~ Add Parallelized Restore Capability to ghe-restore-storage Sep 23, 2020

jianghao0718 changed the title ~~Add Parallelized Restore Capability to ghe-restore-storage~~ Add parallelized restore capability to ghe-restore-storage Sep 23, 2020

This was referenced Sep 23, 2020

Bump version: 2.22.0 #643

Closed

Bump version: 2.22.0 #644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add parallelized restore capability to ghe-restore-storage #635

Add parallelized restore capability to ghe-restore-storage #635

Uh oh!

maclarel commented Sep 8, 2020 •

edited

Loading

Uh oh!

omgitsads left a comment

Uh oh!

maclarel commented Sep 17, 2020 •

edited

Loading

Uh oh!

ryansimmen commented Sep 17, 2020

Uh oh!

Uh oh!

Add parallelized restore capability to ghe-restore-storage #635

Add parallelized restore capability to ghe-restore-storage #635

Uh oh!

Conversation

maclarel commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omgitsads left a comment

Choose a reason for hiding this comment

Uh oh!

maclarel commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryansimmen commented Sep 17, 2020

Uh oh!

Uh oh!

maclarel commented Sep 8, 2020 •

edited

Loading

maclarel commented Sep 17, 2020 •

edited

Loading