-
Notifications
You must be signed in to change notification settings - Fork 609
Fix ghe-backup-repositories
performance for large instances
#541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Assert that 100,000 lines (instead of 10,000 lines) can complete in 2 seconds.
The inline Ruby wanted versions like 2.3.4, but it was getting int-mapped versions like 2003004000. That broke the logic in the inline Ruby completely, so everything was treated as having a major version wayyyy above 3, hence "modern". This is why we unit-test.
It's not installed on either Mac or Linux CI boxes
This reverts commit f1bb5af. DRY. Not a big deal either way.
This reverts commit 7411333. We don't need coreutils. I used $PWD to eliminate the need for `realpath`, and I've implemented `timeout` for OSX. If the Linux CI is missing `timeout`, I'll push another commit to let it use the OSX implementation.
…to fix-backup-timing
We can't rely on the customer's admin box (which isn't the appliance) having any actual programming languages on it. But we can use sed, and sed is good enough to simulate dirname with a single fork+exec. This commit also fixes future versions: 2.20.x, 3.x, etc. All future versions will be >= 2.19.3, so we just combine the 2.19 check with the future-versions check.
The shell `version` function treats garbage as equivalent to 0.0.0, rather than throwing exceptions and exiting with an error code, as the Ruby implementation did. So roll the invalid-input tests into the use-the-old-version test.
cat $tempdir/*.rsync | sort | uniq > $tempdir/source_routes | ||
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | parse_paths | sort | uniq) > $tempdir/destination_routes | ||
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor optimisation suggestion:
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes | |
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | sort | uniq | fix_paths_for_ghe_version) > $tempdir/destination_routes |
Edit: Nevermind, realised that this ordering is important to maintain uniqueness after the dirname
equivalent occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nudge. I can't exactly use your suggestion, but this might be faster: s/sort | uniq/sort -u/
. If you tell sort
it can discard all the duplicate keys, then it has considerably less data to work with and can use both less RAM and less CPU. We may be able to get this step under the 100 seconds it took prior to #524.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, maybe just on Mac. I tried it on Linux and didn't get the same result.
However, another trick that applies here: uniq | sort | uniq
. The data we're sorting has sequences of repeated directories any time there are forks in a network. Running uniq
as a pre-filter reduces the amount of data that sort
has to sort. We do still need the uniq
at the end, in case the duplicates aren't (fully) clustered in the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 That looks like a great approach, I hadn't considered using sort
twice like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion to optimise this a tiny bit further, but apart from that, looks like a great improvement ✨
The data we're sorting has clusters of duplicates in the input, because `dirname` reduces all repos in the same network (i.e., forks) to the same network path. Running `uniq` before `sort` eliminates those duplicates, which means `sort` requires less CPU and RAM to do its thing. We still need `uniq` on the output end, because there's no guarantee that all duplicates in the input are clustered. I've run tests, and the cost of `uniq` is small enough that it does no harm if the input has no duplicates at all.
I’ll take a closer look later when I’m not on mobile but one big issue with this PR: it introduces a requirement on Ruby which is currently not documented. This will possibly be a major breaking unexpected and undocumented change for our customers. |
test/test-ghe-backup.sh
Outdated
timeout() { | ||
ruby -rtimeout -e 'duration = ARGV.shift.to_i; Timeout::timeout(duration) { system(*ARGV) }' "$@" | ||
} | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this is pulling in a pretty large (time, effort and bytes of code) dependency on Ruby (which won't ship with macOS in future) just to run tests on macOS hosts which may not have coreutils
installed. We're already assuming users have coreutils
installed because of our Linux assumptions. How about making coreutils
a macOS requirement too and remove this function?
We brought back coreutils, which provides it.
#524 introduced a performance regression in
ghe-backup-repositories
. This PR speeds up the code introduced in #524 by roughly 3000x. It also fixes a future bug, in which GitHub Enterprise versions like 2.20.x and 3.x.y would have received the deprecated behavior. Finally, the PR adds performance and correctness tests to verify the changes.Performance: Prior to the PR, the
parse_paths
step took 80 seconds to process 10k lines, due toparse_paths
forking and executingawk
,grep
, and sometimesdirline
for each line of the file. At one large customer, this added over three hours toghe-backup
invocations, just doing text processing in the shell. We believe other customers will be affected as well, proportional to the number of repos, wikis, and gists they are backing up.We eliminated the redundant calls to
version
(which invokesawk
), which brought the 10k-line time down to 20 seconds. Replacing thegrep
call with a shell-intrinsic string match brought the time under 10 seconds. Still not good enough.So ultimately we rewrote
parse_paths
as an inline Ruby script, to completely eliminate fork and exec calls. Each 10k lines now takes around 0.03 seconds, which will remove all but ~3 seconds of the 3-hour overhead at the customer who reported the bug.