Fix `ghe-backup-repositories` performance for large instances #541

piki · 2020-01-30T22:16:37Z

#524 introduced a performance regression in ghe-backup-repositories. This PR speeds up the code introduced in #524 by roughly 3000x. It also fixes a future bug, in which GitHub Enterprise versions like 2.20.x and 3.x.y would have received the deprecated behavior. Finally, the PR adds performance and correctness tests to verify the changes.

Performance: Prior to the PR, the parse_paths step took 80 seconds to process 10k lines, due to parse_paths forking and executing awk, grep, and sometimes dirline for each line of the file. At one large customer, this added over three hours to ghe-backup invocations, just doing text processing in the shell. We believe other customers will be affected as well, proportional to the number of repos, wikis, and gists they are backing up.

We eliminated the redundant calls to version (which invokes awk), which brought the 10k-line time down to 20 seconds. Replacing the grep call with a shell-intrinsic string match brought the time under 10 seconds. Still not good enough.

So ultimately we rewrote parse_paths as an inline Ruby script, to completely eliminate fork and exec calls. Each 10k lines now takes around 0.03 seconds, which will remove all but ~3 seconds of the 3-hour overhead at the customer who reported the bug.

Assert that 100,000 lines (instead of 10,000 lines) can complete in 2 seconds.

The inline Ruby wanted versions like 2.3.4, but it was getting int-mapped versions like 2003004000. That broke the logic in the inline Ruby completely, so everything was treated as having a major version wayyyy above 3, hence "modern". This is why we unit-test.

piki · 2020-01-30T22:17:10Z

cc coauthors @ewgenius and @oakeyc
cc @lildue for review as the historical owner of much of this code

It's not installed on either Mac or Linux CI boxes

This reverts commit f1bb5af. DRY. Not a big deal either way.

This reverts commit 7411333. We don't need coreutils. I used $PWD to eliminate the need for `realpath`, and I've implemented `timeout` for OSX. If the Linux CI is missing `timeout`, I'll push another commit to let it use the OSX implementation.

…to fix-backup-timing

share/github-backup-utils/ghe-backup-config

We can't rely on the customer's admin box (which isn't the appliance) having any actual programming languages on it. But we can use sed, and sed is good enough to simulate dirname with a single fork+exec. This commit also fixes future versions: 2.20.x, 3.x, etc. All future versions will be >= 2.19.3, so we just combine the 2.19 check with the future-versions check.

The shell `version` function treats garbage as equivalent to 0.0.0, rather than throwing exceptions and exiting with an error code, as the Ruby implementation did. So roll the invalid-input tests into the use-the-old-version test.

snh · 2020-01-31T02:06:27Z

share/github-backup-utils/ghe-backup-repositories

  cat $tempdir/*.rsync | sort | uniq > $tempdir/source_routes
-  (cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | parse_paths | sort | uniq) > $tempdir/destination_routes
+  (cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes


~~A minor optimisation suggestion:~~

Suggested change

(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes

(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | sort | uniq | fix_paths_for_ghe_version) > $tempdir/destination_routes

Edit: Nevermind, realised that this ordering is important to maintain uniqueness after the dirname equivalent occurs.

Thanks for the nudge. I can't exactly use your suggestion, but this might be faster: s/sort | uniq/sort -u/. If you tell sort it can discard all the duplicate keys, then it has considerably less data to work with and can use both less RAM and less CPU. We may be able to get this step under the 100 seconds it took prior to #524.

Eh, maybe just on Mac. I tried it on Linux and didn't get the same result.

However, another trick that applies here: uniq | sort | uniq. The data we're sorting has sequences of repeated directories any time there are forks in a network. Running uniq as a pre-filter reduces the amount of data that sort has to sort. We do still need the uniq at the end, in case the duplicates aren't (fully) clustered in the input.

👍 That looks like a great approach, I hadn't considered using sort twice like that.

snh

~~One suggestion to optimise this a tiny bit further, but apart from that,~~ looks like a great improvement ✨

The data we're sorting has clusters of duplicates in the input, because `dirname` reduces all repos in the same network (i.e., forks) to the same network path. Running `uniq` before `sort` eliminates those duplicates, which means `sort` requires less CPU and RAM to do its thing. We still need `uniq` on the output end, because there's no guarantee that all duplicates in the input are clustered. I've run tests, and the cost of `uniq` is small enough that it does no harm if the input has no duplicates at all.

lildude · 2020-01-31T09:36:48Z

I’ll take a closer look later when I’m not on mobile but one big issue with this PR: it introduces a requirement on Ruby which is currently not documented. This will possibly be a major breaking unexpected and undocumented change for our customers.

lildude · 2020-01-31T13:00:58Z

test/test-ghe-backup.sh

+    timeout() {
+        ruby -rtimeout -e 'duration = ARGV.shift.to_i; Timeout::timeout(duration) { system(*ARGV) }' "$@"
+    }
+fi


🤔 this is pulling in a pretty large (time, effort and bytes of code) dependency on Ruby (which won't ship with macOS in future) just to run tests on macOS hosts which may not have coreutils installed. We're already assuming users have coreutils installed because of our Linux assumptions. How about making coreutils a macOS requirement too and remove this function?

We brought back coreutils, which provides it.

3.9.2 patch sync

Evgenii Khramkov and others added 13 commits January 30, 2020 11:11

Add testing for parse_paths

6bf9c51

Use substring match instead of grep for each line

1773b6e

remove old implementation and rename

51505a1

Make fix_paths 200x faster by writing it in Ruby

db60313

Make the tests stricter

c834508

Assert that 100,000 lines (instead of 10,000 lines) can complete in 2 seconds.

Make the comments match reality

7e2953c

Indent for readability

d11d045

Uncomment all the tests

c35686f

Remove \ and ; for readability

cbbd447

Allow space in $ver

25c929d

Test quick-fail path for invalid version strings

c6757ec

Copy comment from the original parse_paths

109a6b9

piki requested a review from lildude January 30, 2020 22:16

Evgenii Khramkov and others added 7 commits January 30, 2020 14:32

install coreutils for tests

7411333

switch back to relative import for testlib in tests

f1bb5af

Eliminate the need for realpath

3de44f1

It's not installed on either Mac or Linux CI boxes

Provide timeout command on Macs

f4ecf87

Revert "switch back to relative import for testlib in tests"

c3e960b

This reverts commit f1bb5af. DRY. Not a big deal either way.

Make shellcheck's recommended changes

d364260

Revert "install coreutils for tests"

3f64d3e

This reverts commit 7411333. We don't need coreutils. I used $PWD to eliminate the need for `realpath`, and I've implemented `timeout` for OSX. If the Linux CI is missing `timeout`, I'll push another commit to let it use the OSX implementation.

oakeyc approved these changes Jan 30, 2020

View reviewed changes

Evgenii Khramkov and others added 5 commits January 30, 2020 16:11

fix getting version number in fix_paths_for_ghe_version

66fc050

Merge branch 'fix-backup-timing' of github.com:github/backup-utils in…

00967d7

…to fix-backup-timing

remove "v" from version only if it there

91ebc5b

update tests (including "v" and "without-v" formats)

b3bc658

Regexes are idiomatic Ruby

cf2f9c7

snh reviewed Jan 31, 2020

View reviewed changes

share/github-backup-utils/ghe-backup-config Outdated Show resolved Hide resolved

piki mentioned this pull request Jan 31, 2020

WIP: Reducing gist backup timings #540

Closed

Patrick Reynolds added 2 commits January 30, 2020 20:46

Combine "fail" with "old" tests

01cda66

The shell `version` function treats garbage as equivalent to 0.0.0, rather than throwing exceptions and exiting with an error code, as the Ruby implementation did. So roll the invalid-input tests into the use-the-old-version test.

snh reviewed Jan 31, 2020

View reviewed changes

snh approved these changes Jan 31, 2020

View reviewed changes

lildude reviewed Jan 31, 2020

View reviewed changes

Evgenii Khramkov and others added 2 commits January 31, 2020 12:58

install coreutils for tests

b023bcd

Remove Ruby-based implementation of timeout

5206c13

We brought back coreutils, which provides it.

ewgenius added the bug label Jan 31, 2020

ewgenius merged commit c5c4ddd into stable Jan 31, 2020

ewgenius deleted the fix-backup-timing branch January 31, 2020 18:14

This was referenced Jan 31, 2020

Bump version: 2.19.2 #542

Merged

Merge stable to master #543

Merged

piki mentioned this pull request Jan 31, 2020

Allow specifying base branch for release script #544

Merged

This was referenced Feb 11, 2020

Bump version: 2.20.0 #555

Merged

Bump version: 2.20.0 #561

Closed

Bump version: 2.20.0 #562

Merged

Bump version: 2.20.0 #564

Merged

dooleydevin added a commit that referenced this pull request Oct 2, 2023

Merge pull request #541 from github/enterprise-3.9-backport-patch-3.9.2

a9585d2

3.9.2 patch sync

dooleydevin mentioned this pull request Oct 2, 2023

Bump version: 3.9.3 #1130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `ghe-backup-repositories` performance for large instances #541

Fix `ghe-backup-repositories` performance for large instances #541

Uh oh!

piki commented Jan 30, 2020

Uh oh!

piki commented Jan 30, 2020

Uh oh!

Uh oh!

snh Jan 31, 2020 •

edited

Loading

Uh oh!

piki Jan 31, 2020 •

edited

Loading

Uh oh!

piki Jan 31, 2020

Uh oh!

snh Jan 31, 2020

Uh oh!

snh left a comment •

edited

Loading

Uh oh!

lildude commented Jan 31, 2020

Uh oh!

lildude Jan 31, 2020

Uh oh!

Uh oh!

	(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git \| fix_paths_for_ghe_version \| sort \| uniq) > $tempdir/destination_routes
	(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git \| sort \| uniq \| fix_paths_for_ghe_version) > $tempdir/destination_routes

Fix ghe-backup-repositories performance for large instances #541

Fix ghe-backup-repositories performance for large instances #541

Uh oh!

Conversation

piki commented Jan 30, 2020

Uh oh!

piki commented Jan 30, 2020

Uh oh!

Uh oh!

snh Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piki Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piki Jan 31, 2020

Choose a reason for hiding this comment

Uh oh!

snh Jan 31, 2020

Choose a reason for hiding this comment

Uh oh!

snh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lildude commented Jan 31, 2020

Uh oh!

lildude Jan 31, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix `ghe-backup-repositories` performance for large instances #541

Fix `ghe-backup-repositories` performance for large instances #541

snh Jan 31, 2020 •

edited

Loading

piki Jan 31, 2020 •

edited

Loading

snh left a comment •

edited

Loading