Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9a264de

Browse files
dannykoppingmatifali
authored andcommitted
chore: improve CI reliability (#16169)
We have an effort underway to replace `dbmem` (#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520. Since this change, we've seen a marked decrease in the success rate of our builds on `main` (note how the Windows/MacOS failures account for the vast majority of failed builds): ![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0) We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else. **In the meantime, we need to start trusting CI failures in `main` again, as the current failures are too noisy / vague for us to correct.** We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities. See the [meeting notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl) where we linked into this for more detail. This PR introduces several changes: 1. Moves the full test-suite with Postgres on Windows/MacOS to the `nightly-gauntlet` workflow tradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day 2. Run only the CLI test-suite on each PR / merge to `main` on Windows/MacOS 3. `test-go` is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once #15109 is completed since it uses `dbmem` 4. Changes `nightly-gauntlet` to run at 4AM: we've seen several instances of the runner being stopped externally, and we're _guessing_ this may have something to do with the midnight UTC execution time, when other cron jobs may run 5. Removes the existing `nightly-gauntlet` jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessary I've manually run both these new workflows successfully: - `ci`: https://github.com/coder/coder/actions/runs/12825874176/job/35764724907 - `nightly-gauntlet`: https://github.com/coder/coder/actions/runs/12825539092 --------- Signed-off-by: Danny Kopping <[email protected]> Co-authored-by: Muhammad Atif Ali <[email protected]>
1 parent 02aa1ac commit 9a264de

File tree

3 files changed

+121
-74
lines changed

3 files changed

+121
-74
lines changed

.github/workflows/ci.yaml

+56-32
Original file line numberDiff line numberDiff line change
@@ -378,8 +378,62 @@ jobs:
378378
with:
379379
api-key: ${{ secrets.DATADOG_API_KEY }}
380380

381+
# We don't run the full test-suite for Windows & MacOS, so we just run the CLI tests on every PR.
382+
# We run the test suite in test-go-pg, including CLI.
383+
test-cli:
384+
runs-on: ${{ matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' || matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' || matrix.os }}
385+
needs: changes
386+
if: needs.changes.outputs.go == 'true' || needs.changes.outputs.ci == 'true' || github.ref == 'refs/heads/main'
387+
strategy:
388+
matrix:
389+
os:
390+
- macos-latest
391+
- windows-2022
392+
steps:
393+
- name: Harden Runner
394+
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
395+
with:
396+
egress-policy: audit
397+
398+
- name: Checkout
399+
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
400+
with:
401+
fetch-depth: 1
402+
403+
- name: Setup Go
404+
uses: ./.github/actions/setup-go
405+
406+
- name: Setup Terraform
407+
uses: ./.github/actions/setup-tf
408+
409+
# Sets up the ImDisk toolkit for Windows and creates a RAM disk on drive R:.
410+
- name: Setup ImDisk
411+
if: runner.os == 'Windows'
412+
uses: ./.github/actions/setup-imdisk
413+
414+
- name: Test CLI
415+
env:
416+
TS_DEBUG_DISCO: "true"
417+
LC_CTYPE: "en_US.UTF-8"
418+
LC_ALL: "en_US.UTF-8"
419+
shell: bash
420+
run: |
421+
# By default Go will use the number of logical CPUs, which
422+
# is a fine default.
423+
PARALLEL_FLAG=""
424+
425+
make test-cli
426+
427+
- name: Upload test stats to Datadog
428+
timeout-minutes: 1
429+
continue-on-error: true
430+
uses: ./.github/actions/upload-datadog
431+
if: success() || failure()
432+
with:
433+
api-key: ${{ secrets.DATADOG_API_KEY }}
434+
381435
test-go-pg:
382-
runs-on: ${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' || matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' || matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' || matrix.os }}
436+
runs-on: ${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' || matrix.os }}
383437
needs: changes
384438
if: needs.changes.outputs.go == 'true' || needs.changes.outputs.ci == 'true' || github.ref == 'refs/heads/main'
385439
# This timeout must be greater than the timeout set by `go test` in
@@ -391,8 +445,6 @@ jobs:
391445
matrix:
392446
os:
393447
- ubuntu-latest
394-
- macos-latest
395-
- windows-2022
396448
steps:
397449
- name: Harden Runner
398450
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
@@ -423,39 +475,11 @@ jobs:
423475
LC_ALL: "en_US.UTF-8"
424476
shell: bash
425477
run: |
426-
# if macOS, install google-chrome for scaletests
427-
# As another concern, should we really have this kind of external dependency
428-
# requirement on standard CI?
429-
if [ "${{ matrix.os }}" == "macos-latest" ]; then
430-
brew install google-chrome
431-
fi
432-
433478
# By default Go will use the number of logical CPUs, which
434479
# is a fine default.
435480
PARALLEL_FLAG=""
436481
437-
# macOS will output "The default interactive shell is now zsh"
438-
# intermittently in CI...
439-
if [ "${{ matrix.os }}" == "macos-latest" ]; then
440-
touch ~/.bash_profile && echo "export BASH_SILENCE_DEPRECATION_WARNING=1" >> ~/.bash_profile
441-
fi
442-
443-
if [ "${{ runner.os }}" == "Linux" ]; then
444-
make test-postgres
445-
elif [ "${{ runner.os }}" == "Windows" ]; then
446-
# Create a temp dir on the R: ramdisk drive for Windows. The default
447-
# C: drive is extremely slow: https://github.com/actions/runner-images/issues/8755
448-
mkdir -p "R:/temp/embedded-pg"
449-
go run scripts/embedded-pg/main.go -path "R:/temp/embedded-pg"
450-
# Reduce test parallelism, mirroring what we do for race tests.
451-
# We'd been encountering issues with timing related flakes, and
452-
# this seems to help.
453-
DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...
454-
else
455-
go run scripts/embedded-pg/main.go
456-
# Reduce test parallelism, like for Windows above.
457-
DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...
458-
fi
482+
make test-postgres
459483
460484
- name: Upload test stats to Datadog
461485
timeout-minutes: 1

.github/workflows/nightly-gauntlet.yaml

+61-42
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,27 @@
33
name: nightly-gauntlet
44
on:
55
schedule:
6-
# Every day at midnight
7-
- cron: "0 0 * * *"
6+
# Every day at 4AM
7+
- cron: "0 4 * * 1-5"
88
workflow_dispatch:
99

1010
permissions:
1111
contents: read
1212

1313
jobs:
14-
go-race:
15-
# While GitHub's toaster runners are likelier to flake, we want consistency
16-
# between this environment and the regular test environment for DataDog
17-
# statistics and to only show real workflow threats.
18-
runs-on: ${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04-8' || 'ubuntu-latest' }}
19-
# This runner costs 0.016 USD per minute,
20-
# so 0.016 * 240 = 3.84 USD per run.
21-
timeout-minutes: 240
14+
test-go-pg:
15+
runs-on: ${{ matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' || matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' || matrix.os }}
16+
if: github.ref == 'refs/heads/main'
17+
# This timeout must be greater than the timeout set by `go test` in
18+
# `make test-postgres` to ensure we receive a trace of running
19+
# goroutines. Setting this to the timeout +5m should work quite well
20+
# even if some of the preceding steps are slow.
21+
timeout-minutes: 25
22+
strategy:
23+
matrix:
24+
os:
25+
- macos-latest
26+
- windows-2022
2227
steps:
2328
- name: Harden Runner
2429
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
@@ -27,58 +32,72 @@ jobs:
2732

2833
- name: Checkout
2934
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
35+
with:
36+
fetch-depth: 1
3037

3138
- name: Setup Go
3239
uses: ./.github/actions/setup-go
3340

3441
- name: Setup Terraform
3542
uses: ./.github/actions/setup-tf
3643

37-
- name: Run Tests
38-
run: |
39-
# -race is likeliest to catch flaky tests
40-
# due to correctness detection and its performance
41-
# impact.
42-
gotestsum --junitfile="gotests.xml" -- -timeout=240m -count=10 -race ./...
44+
# Sets up the ImDisk toolkit for Windows and creates a RAM disk on drive R:.
45+
- name: Setup ImDisk
46+
if: runner.os == 'Windows'
47+
uses: ./.github/actions/setup-imdisk
4348

44-
- name: Upload test results to DataDog
45-
uses: ./.github/actions/upload-datadog
46-
if: always()
47-
with:
48-
api-key: ${{ secrets.DATADOG_API_KEY }}
49+
- name: Test with PostgreSQL Database
50+
env:
51+
POSTGRES_VERSION: "13"
52+
TS_DEBUG_DISCO: "true"
53+
LC_CTYPE: "en_US.UTF-8"
54+
LC_ALL: "en_US.UTF-8"
55+
shell: bash
56+
run: |
57+
# if macOS, install google-chrome for scaletests
58+
# As another concern, should we really have this kind of external dependency
59+
# requirement on standard CI?
60+
if [ "${{ matrix.os }}" == "macos-latest" ]; then
61+
brew install google-chrome
62+
fi
4963
50-
go-timing:
51-
# We run these tests with p=1 so we don't need a lot of compute.
52-
runs-on: ${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04' || 'ubuntu-latest' }}
53-
timeout-minutes: 10
54-
steps:
55-
- name: Harden Runner
56-
uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2
57-
with:
58-
egress-policy: audit
64+
# By default Go will use the number of logical CPUs, which
65+
# is a fine default.
66+
PARALLEL_FLAG=""
5967
60-
- name: Checkout
61-
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
68+
# macOS will output "The default interactive shell is now zsh"
69+
# intermittently in CI...
70+
if [ "${{ matrix.os }}" == "macos-latest" ]; then
71+
touch ~/.bash_profile && echo "export BASH_SILENCE_DEPRECATION_WARNING=1" >> ~/.bash_profile
72+
fi
6273
63-
- name: Setup Go
64-
uses: ./.github/actions/setup-go
74+
if [ "${{ runner.os }}" == "Windows" ]; then
75+
# Create a temp dir on the R: ramdisk drive for Windows. The default
76+
# C: drive is extremely slow: https://github.com/actions/runner-images/issues/8755
77+
mkdir -p "R:/temp/embedded-pg"
78+
go run scripts/embedded-pg/main.go -path "R:/temp/embedded-pg"
79+
else
80+
go run scripts/embedded-pg/main.go
81+
fi
6582
66-
- name: Run Tests
67-
run: |
68-
gotestsum --junitfile="gotests.xml" -- --tags="timing" -p=1 -run='_Timing/' ./...
83+
# Reduce test parallelism, mirroring what we do for race tests.
84+
# We'd been encountering issues with timing related flakes, and
85+
# this seems to help.
86+
DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...
6987
70-
- name: Upload test results to DataDog
88+
- name: Upload test stats to Datadog
89+
timeout-minutes: 1
90+
continue-on-error: true
7191
uses: ./.github/actions/upload-datadog
72-
if: always()
92+
if: success() || failure()
7393
with:
7494
api-key: ${{ secrets.DATADOG_API_KEY }}
7595

7696
notify-slack-on-failure:
7797
needs:
78-
- go-race
79-
- go-timing
98+
- test-go-pg
8099
runs-on: ubuntu-latest
81-
if: failure()
100+
if: failure() && github.ref == 'refs/heads/main'
82101

83102
steps:
84103
- name: Send Slack notification

Makefile

+4
Original file line numberDiff line numberDiff line change
@@ -807,6 +807,10 @@ test:
807807
$(GIT_FLAGS) gotestsum --format standard-quiet -- -v -short -count=1 ./...
808808
.PHONY: test
809809

810+
test-cli:
811+
$(GIT_FLAGS) gotestsum --format standard-quiet -- -v -short -count=1 ./cli/...
812+
.PHONY: test-cli
813+
810814
# sqlc-cloud-is-setup will fail if no SQLc auth token is set. Use this as a
811815
# dependency for any sqlc-cloud related targets.
812816
sqlc-cloud-is-setup:

0 commit comments

Comments
 (0)