Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix ghe-backup-repositories performance for large instances #541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jan 31, 2020

Conversation

piki
Copy link

@piki piki commented Jan 30, 2020

#524 introduced a performance regression in ghe-backup-repositories. This PR speeds up the code introduced in #524 by roughly 3000x. It also fixes a future bug, in which GitHub Enterprise versions like 2.20.x and 3.x.y would have received the deprecated behavior. Finally, the PR adds performance and correctness tests to verify the changes.

Performance: Prior to the PR, the parse_paths step took 80 seconds to process 10k lines, due to parse_paths forking and executing awk, grep, and sometimes dirline for each line of the file. At one large customer, this added over three hours to ghe-backup invocations, just doing text processing in the shell. We believe other customers will be affected as well, proportional to the number of repos, wikis, and gists they are backing up.

We eliminated the redundant calls to version (which invokes awk), which brought the 10k-line time down to 20 seconds. Replacing the grep call with a shell-intrinsic string match brought the time under 10 seconds. Still not good enough.

So ultimately we rewrote parse_paths as an inline Ruby script, to completely eliminate fork and exec calls. Each 10k lines now takes around 0.03 seconds, which will remove all but ~3 seconds of the 3-hour overhead at the customer who reported the bug.

Evgenii Khramkov and others added 13 commits January 30, 2020 11:11
Assert that 100,000 lines (instead of 10,000 lines) can complete in 2 seconds.
The inline Ruby wanted versions like 2.3.4, but it was getting int-mapped
versions like 2003004000.  That broke the logic in the inline Ruby
completely, so everything was treated as having a major version wayyyy
above 3, hence "modern".

This is why we unit-test.
@piki piki requested a review from lildude January 30, 2020 22:16
@piki
Copy link
Author

piki commented Jan 30, 2020

cc coauthors @ewgenius and @oakeyc
cc @lildue for review as the historical owner of much of this code

Evgenii Khramkov and others added 7 commits January 30, 2020 14:32
It's not installed on either Mac or Linux CI boxes
This reverts commit f1bb5af.

DRY.  Not a big deal either way.
This reverts commit 7411333.

We don't need coreutils.  I used $PWD to eliminate the need for
`realpath`, and I've implemented `timeout` for OSX.  If the Linux CI is
missing `timeout`, I'll push another commit to let it use the OSX
implementation.
Patrick Reynolds added 2 commits January 30, 2020 20:46
We can't rely on the customer's admin box (which isn't the appliance)
having any actual programming languages on it.  But we can use sed, and
sed is good enough to simulate dirname with a single fork+exec.

This commit also fixes future versions: 2.20.x, 3.x, etc.  All future
versions will be >= 2.19.3, so we just combine the 2.19 check with the
future-versions check.
The shell `version` function treats garbage as equivalent to 0.0.0, rather
than throwing exceptions and exiting with an error code, as the Ruby
implementation did.  So roll the invalid-input tests into the
use-the-old-version test.
cat $tempdir/*.rsync | sort | uniq > $tempdir/source_routes
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | parse_paths | sort | uniq) > $tempdir/destination_routes
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes
Copy link
Member

@snh snh Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor optimisation suggestion:

Suggested change
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | fix_paths_for_ghe_version | sort | uniq) > $tempdir/destination_routes
(cd $backup_dir/ && find * -mindepth 5 -maxdepth 6 -type d -name \*.git | sort | uniq | fix_paths_for_ghe_version) > $tempdir/destination_routes

Edit: Nevermind, realised that this ordering is important to maintain uniqueness after the dirname equivalent occurs.

Copy link
Author

@piki piki Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nudge. I can't exactly use your suggestion, but this might be faster: s/sort | uniq/sort -u/. If you tell sort it can discard all the duplicate keys, then it has considerably less data to work with and can use both less RAM and less CPU. We may be able to get this step under the 100 seconds it took prior to #524.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, maybe just on Mac. I tried it on Linux and didn't get the same result.

However, another trick that applies here: uniq | sort | uniq. The data we're sorting has sequences of repeated directories any time there are forks in a network. Running uniq as a pre-filter reduces the amount of data that sort has to sort. We do still need the uniq at the end, in case the duplicates aren't (fully) clustered in the input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 That looks like a great approach, I hadn't considered using sort twice like that.

Copy link
Member

@snh snh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion to optimise this a tiny bit further, but apart from that, looks like a great improvement ✨

The data we're sorting has clusters of duplicates in the input, because
`dirname` reduces all repos in the same network (i.e., forks) to the same
network path.  Running `uniq` before `sort` eliminates those duplicates,
which means `sort` requires less CPU and RAM to do its thing.

We still need `uniq` on the output end, because there's no guarantee that
all duplicates in the input are clustered.

I've run tests, and the cost of `uniq` is small enough that it does no
harm if the input has no duplicates at all.
@lildude
Copy link
Member

lildude commented Jan 31, 2020

I’ll take a closer look later when I’m not on mobile but one big issue with this PR: it introduces a requirement on Ruby which is currently not documented. This will possibly be a major breaking unexpected and undocumented change for our customers.

timeout() {
ruby -rtimeout -e 'duration = ARGV.shift.to_i; Timeout::timeout(duration) { system(*ARGV) }' "$@"
}
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 this is pulling in a pretty large (time, effort and bytes of code) dependency on Ruby (which won't ship with macOS in future) just to run tests on macOS hosts which may not have coreutils installed. We're already assuming users have coreutils installed because of our Linux assumptions. How about making coreutils a macOS requirement too and remove this function?

Evgenii Khramkov and others added 2 commits January 31, 2020 12:58
@ewgenius ewgenius added the bug label Jan 31, 2020
@ewgenius ewgenius merged commit c5c4ddd into stable Jan 31, 2020
@ewgenius ewgenius deleted the fix-backup-timing branch January 31, 2020 18:14
This was referenced Jan 31, 2020
dooleydevin added a commit that referenced this pull request Oct 2, 2023
@dooleydevin dooleydevin mentioned this pull request Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants