Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Mac presubmit queues are out of SLO #114656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CaseyHillers opened this issue Nov 4, 2022 · 24 comments
Closed

Mac presubmit queues are out of SLO #114656

CaseyHillers opened this issue Nov 4, 2022 · 24 comments
Labels
team-infra Owned by Infrastructure team

Comments

@CaseyHillers
Copy link
Contributor

I've had several PRs the past 2 weeks be out of SLO (stuck in queue for 2 hours). I don't send PRs during the peak hours, and I expect this is worse for those contributing during those hours.

Example PR: #114646 (first commit was queued for 2 hours)

@CaseyHillers CaseyHillers added the team-infra Owned by Infrastructure team label Nov 4, 2022
@godofredoc
Copy link
Contributor

2h is probably too much but a huge increase in execution|queue time is expected after #113539 where we are explicitly cleaning the local xcode cache

@godofredoc
Copy link
Contributor

Setting as P1 seems like mac capacity is at 100% utilization from 6:00AM PST to 5:00PM PST.

Next step we need to identify if this is normal usage or something strange is happening.

@godofredoc
Copy link
Contributor

From the data it seems like we've always had peeks of 30M 100% utilization at ~9:00AM PST but we started to have 3+ peeks per day on 10/17/2022. Things got really bad on 10/24/2022 where we started with 100% utilization from 9:00 to 5:00

@keyonghan
Copy link
Contributor

keyonghan commented Nov 4, 2022

The local cache cleanup happened on 10/19 (#113729), the queue time (90th%) was <1min for 10/19-10/24.
The queue time started increasing from 10/25
10/25-10/26: 6.06 min
10/26-10/27: 60.5 min
10/17-10/28: 58.68 min
10/28-11/04: 85.27 min

@keyonghan
Copy link
Contributor

keyonghan commented Nov 4, 2022

One correlated change per the timing is the quota increase (cl/483782210, which was landed 0n 10/25). @godofredoc Are you comfortable with a revert?

@godofredoc
Copy link
Contributor

One correlated change per the timing is the quota increase (cl/483782210). @godofredoc Are you comfortable with a revert?
That change is supposed to help. Can we take a look if the repo sync time has increased? if cold checkouts are >7m then please do not revert.

@godofredoc
Copy link
Contributor

Ths is

0/17-10/28: 58.68 min

The local cache cleanup happened on 10/19 (#113729), the queue time (90th%) was <1min for 10/19-10/24.
The queue time started increasing from 10/25
10/25-10/26: 6.06 min
10/26-10/27: 60.5 min
10/17-10/28: 58.68 min
10/28-11/04: 85.27 min

Can you please post daily data for: 10/25, 10/27, 11/03?

@godofredoc
Copy link
Contributor

That is expected, cl/483782210 fixes peek quota whith checkouts taking 40mins in try.

5-7 mins is expected in cold checkouts if we are getting consistent 5-7 mins checkouts then that is a bug that need to be fixed.

@keyonghan
Copy link
Contributor

Can you please post daily data for: 10/25, 10/27, 11/03?

10/25: 6 min
10/27: 58 min
11/03: 100 min

@godofredoc
Copy link
Contributor

godofredoc commented Nov 4, 2022

Are these 90th percentiles? can we also get the 50th?

@keyonghan
Copy link
Contributor

keyonghan commented Nov 4, 2022

Here are the comparison between 90th and 50th, plus the build number:
date. 90th 50th No. of builds
10/24. 19 6 1605
10/25. 6 0 1192
10/26. 60 15 1983
10/27. 58 21 2046
11/03. 100 7 2493

The queue time correlates with the No. of builds

@keyonghan
Copy link
Contributor

When did you enable the Mac engine v2 builders?

@godofredoc
Copy link
Contributor

Nice, in this case the 50th %tile is giving us a better signal.

Higher queue times are expected in engine_v2 under the current conditions because they require multiple bots to complete.

The decrease in the 50th percentile which I believe most of the legacy builds will fit into, except for engine_v2, web engine and fuchsia builders was caused by the following PRs:

@godofredoc
Copy link
Contributor

godofredoc commented Nov 4, 2022

Probably we are making thing a bit worse for a short period of time ~4h while we re-allocate mac machines to different pools.

The plan to address the queue time is as follows:

  • Fix checkout caches in mac builders
    • Do not clean source code caches by default in Web Engine builds
    • Do not clean source code caches by default engine_arm builds
  • Ensure https://flutter.googlesource.com/recipes/+/e02f2b41ba98db6d230fee81689e4a6e89526453 is CP to stable and beta versions of recipes
  • Remove the flag to always clean the xcode caches. The caches of all the machines should be clean by now and some - logic has been added to cleanup the cache only when using simulator runtimes.
  • Use linux machines for orchestrators in engine_v2
  • Move some machines from staging to try now that engine_v2 builders have moved to prod

@godofredoc
Copy link
Contributor

@khyati82 this is the impact we were expecting during the transition of legacy to engine_v2.

@keyonghan
Copy link
Contributor

https://chrome-internal-review.googlesource.com/c/infradata/config/+/5074772 to move half (8) bots from staging to try.

@keyonghan
Copy link
Contributor

keyonghan commented Nov 4, 2022

Use linux machines for orchestrators in engine_v2

Do you mean to use linux VMs to host Mac engine_v2 builders? @godofredoc

@godofredoc
Copy link
Contributor

Use linux machines for orchestrators in engine_v2

Do you mean to use linux VMs to host Mac engine_v2 builders? @godofredoc

The current implementation will make this look weird. Basically I meant to change https://github.com/flutter/engine/blob/main/.ci.yaml#L327 to use a linux machine rather than mac. Which I believe requires to change Mac mac_android_aot_engine to Linux mac_android_aot_engine

auto-submit bot pushed a commit that referenced this issue Nov 5, 2022
After all the caches have been cleaned and the logic to cleanup runtimes
has landed we may not need to always delete the caches.

Bug: #114656
@godofredoc
Copy link
Contributor

CL to default to not cleaning source code caches in Web Engine builds: https://flutter-review.googlesource.com/c/recipes/+/35661

auto-submit bot pushed a commit to flutter/engine that referenced this issue Nov 7, 2022
This property was previously hardcoded in the recipe.

Bug: flutter/flutter#114656
@gaaclarke
Copy link
Member

I'm not sure how you are distributing mac resources but fwiw https://flutter-review.git.corp.google.com/c/recipes/+/35760 should lower mac clang-tidy bot execution 5-10m. The number of linted files is going from 800->600. Hopefully that helps a bit.

auto-submit bot pushed a commit to flutter/engine that referenced this issue Nov 8, 2022
The json translation of gclient_vars is failing because it can't parse
the boolean value as True.

Bug: flutter/flutter#114656
auto-submit bot pushed a commit to flutter/cocoon that referenced this issue Nov 8, 2022
This is to make the behavior consistent for json lists and dictionaries.

Bug: flutter/flutter#114656
naudzghebre pushed a commit to naudzghebre/engine that referenced this issue Nov 9, 2022
This property was previously hardcoded in the recipe.

Bug: flutter/flutter#114656
naudzghebre pushed a commit to naudzghebre/engine that referenced this issue Nov 9, 2022
The json translation of gclient_vars is failing because it can't parse
the boolean value as True.

Bug: flutter/flutter#114656
whesse added a commit to whesse/engine that referenced this issue Nov 10, 2022
The support for per-build custom gclient variable overrides is
removed from the build recipes, so the unused fields in the
build configuration json are removed here.

This completes the change made in flutter#37351

Bug: flutter/flutter#114656
schwa423 pushed a commit to schwa423/engine that referenced this issue Nov 16, 2022
This property was previously hardcoded in the recipe.

Bug: flutter/flutter#114656
schwa423 pushed a commit to schwa423/engine that referenced this issue Nov 16, 2022
The json translation of gclient_vars is failing because it can't parse
the boolean value as True.

Bug: flutter/flutter#114656
@keyonghan
Copy link
Contributor

Queue time drops to 20mins for past two weeks. Moving to TD for further optimization track.

@keyonghan keyonghan removed their assignment Nov 23, 2022
shogohida pushed a commit to shogohida/flutter that referenced this issue Dec 7, 2022
After all the caches have been cleaned and the logic to cleanup runtimes
has landed we may not need to always delete the caches.

Bug: flutter#114656
gspencergoog pushed a commit to gspencergoog/flutter that referenced this issue Jan 19, 2023
After all the caches have been cleaned and the logic to cleanup runtimes
has landed we may not need to always delete the caches.

Bug: flutter#114656
@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
team-infra Owned by Infrastructure team
Projects
None yet
Development

No branches or pull requests

4 participants