Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Remove the in-memory database #15109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hugodutka opened this issue Oct 16, 2024 · 12 comments
Open

Remove the in-memory database #15109

hugodutka opened this issue Oct 16, 2024 · 12 comments

Comments

@hugodutka
Copy link
Contributor

The in-memory database is currently only used in tests. It was originally added to ensure tests that do not depend on complex db logic can pass quickly.

However, I've recently ran performance tests on the postgres and in-memory test suites (both on full suites and on individual tests) and it seems there's almost no timing difference between them. After talking to @kylecarbs we've decided to get rid of the in-memory database and see what happens. If we see there was a good reason for it to exist then we'll bring it back.

Currently, coder supports the --in-memory flag. It should be deprecated and replaced by the --ephemeral flag, which would initialize a new postgres db every time.

@coder-labeler coder-labeler bot added api Area: HTTP API needs decision Needs a higher-level decision to be unblocked. labels Oct 16, 2024
@johnstcn
Copy link
Member

@hugodutka You might want to also look at references to dbtestutil.WillUsePostgres when you do this :-)

@matifali matifali removed needs decision Needs a higher-level decision to be unblocked. api Area: HTTP API labels Oct 17, 2024
@spikecurtis
Copy link
Contributor

I've recently ran performance tests on the postgres and in-memory test suites (both on full suites and on individual tests) and it seems there's almost no timing difference between them.

Can you share more details? Like, how you tested and what the numbers were?

In CI we use a 4 core runner for the in-memory tests and an 8 core runner for the postgres tests and they both take 4 minutes.

Just anecdotally, when I run the coderd test suite with the in-memory database most tests take less than 1 sec, but with DB=ci most take 10s+.

@hugodutka
Copy link
Contributor Author

For a single test run in isolation here’s what I did:

  1. Checked out https://github.com/coder/coder to the latest main at the time (79d24d2)
  2. Spun up a devcontainer from .devcontainer/devcontainer.json
  3. Started a postgres 16 instance with docker compose with no optimizations (https://gist.github.com/hugodutka/9c19b7608dda5c33d0154166dade9789)
  4. Ran go run scripts/migrate-ci/main.go to create a template database
  5. Ran a single test with postgres and without postgres:
coder@87ae820df95f:/workspaces/coder$ time DB=ci DB_FROM=ciepcpnjfbpj gotestsum -- -v -count=1 ./enterprise/coderd -run TestCreateGroup/Audit
✓  enterprise/coderd (250ms)

DONE 2 tests in 2.659s

real    0m2.663s
user    0m3.639s
sys     0m1.392s
coder@87ae820df95f:/workspaces/coder$ time gotestsum -- -v -count=1 ./enterprise/coderd -run TestCreateGroup/Audit
✓  enterprise/coderd (86ms)

DONE 2 tests in 2.480s

real    0m2.483s
user    0m3.534s
sys     0m1.452s

@spikecurtis when running tests with DB=ci do you also use DB_FROM? If yes, can you link to a test that takes 10s+? I'd definitely take a second look.

@hugodutka
Copy link
Contributor Author

For the full suite, I changed this line to return false so that both in-memory and postgres would run the same tests.

Postgres Test:

go clean -cache
DB=ci DB_FROM=cikwknhryewg gotestsum \
		--junitfile="gotests.xml" \
		--jsonfile="gotests.json" \
		--packages="./..." -- \
		-timeout=20m \
		-failfast \
		-count=1

Took 108s without build cache.

In-Memory Test:

go clean -cache
gotestsum \
		--junitfile="gotests.xml" \
		--jsonfile="gotests.json" \
		--packages="./..." -- \
		-timeout=20m \
		-failfast \
		-count=1

Took 121s without build cache.

I ran all tests on a local machine - AMD Ryzen 7 5800X, 32 GB RAM, and some NVME.

@Emyrk
Copy link
Member

Emyrk commented Oct 17, 2024

@hugodutka You might want to also look at references to dbtestutil.WillUsePostgres when you do this :-)

Some tests using the memory implementation intentionally do not insert all the data dependencies to run a test. The real db has foreign key dependents, so those tests would probably now fail.

On the other side of things, tests that are insert heavy use the in memory implementation as I found many inserts make the db tests slower.

@johnstcn
Copy link
Member

johnstcn commented Oct 17, 2024

tests that are insert heavy use the in memory implementation as I found many inserts make the db tests slower

I wonder if it makes sense to allow seeding the database with some raw SQL? e.g. dbtestutil.FromDump("path/to/dump.sql")?

@Emyrk
Copy link
Member

Emyrk commented Oct 17, 2024

@johnstcn I think that would be huge, but dbgen is such a clutch tool. I wonder if we can make dbgen batch a set of inserts if run in an InTx or something. Unsure how to design it.

I just do not want to have to write raw sql for a test setup 😢

@spikecurtis
Copy link
Contributor

@hugodutka I was not running with DB_FROM.

I ran make test-postgres that sets this up on my M1 mac, and test time was 9 minutes (vs 2 minutes make test). The test-postgres target had a bunch of failures which could be related to the longer test times. It seems likely that some of our Postgres tests assume Linux, since that's the only OS we've been diligent about testing with Postgres on.

It's also a challenge to use DB_FROM within GoLand. The default run configuration environment variables setup doesn't interpret $(shell ...) commands and just passes them verbatim, so it doesn't actually do the initial setup of a database with the right schema.

These issues are probably soluble with some effort, but just understand that it's going to be more complicated than just deleting a bunch of code.

@hugodutka
Copy link
Contributor Author

@spikecurtis Thanks for the details! I checked the Docker image for Postgres that make test-postgres uses (gcr.io/coder-dev-1/postgres), and it seems to be linux/amd64 only. That would explain the longer test times on your M1 Mac, as it's likely running through Rosetta.

I totally agree with you - there are a lot of little details to consider, and it's not as simple as just deleting a bunch of code. For instance, to make Postgres tests snappy, we can't afford to run migrations for every test. Since using DB_FROM can be easy to forget or tricky, as you mentioned, I plan to write some code to automatically create a template database when tests are run and cache it. This template would only be recreated if any migration files change, allowing you to run tests multiple times without rerunning migrations each time.

The issue description was a bit light on specifics, but I was planning to work out those details in the PR. Appreciate your input - it helps to make sure the PR will be useful.

@Emyrk
Copy link
Member

Emyrk commented Oct 18, 2024

@hugodutka Just some extra random information.

If you run Coder locally, we default to running a local Postgres database. This database persists it's data somewhere on disk, I can't recall exactly where. (~/.config/coderv2/postgres/?). So there is precedent for storing postgres data locally.

You mentioned saving a template cache, would this be something like a pg_dump? Curious if we could have the test-postgres just save data to disk, and actually reuse the postgres data to skip a lot of setup.

We would have to make sure to clean up said resources though if we don't use ephemeral docker volumes. Might be more work than it's worth 🤷‍♂

@hugodutka
Copy link
Contributor Author

hugodutka commented Oct 18, 2024

@Emyrk Regarding template cache I was thinking about a regular database in whatever postgres instance the user is already running. My plan is to name the template database with the hash of the contents of all migration files. Then when a test is run, it checks if such a database already exists and if not, creates it. Then it creates its own test db with CREATE DATABASE <test_db> TEMPLATE <hash>; It wouldn't require any extra clean up.

@hugodutka
Copy link
Contributor Author

hugodutka commented Nov 1, 2024

Progress Update

I've completed the conversion of the entire test suite to use PostgreSQL, fully removing dbmem in #15291. The suite is now passing on Linux. However, the PR is substantial, with 3.3k lines added and 12k lines removed, making it too large for a straightforward merge. I’ll be splitting it into smaller, more manageable PRs, each focused on a specific aspect.

As I started enabling PostgreSQL in CI for Windows, macOS, and race tests, I learned that:

  • The PostgreSQL race tests were already broken on the main branch.
  • Docker isn’t readily available in CI for Windows and macOS, and I couldn’t find a way to set it up easily. Since our setup for PostgreSQL depends on Docker, I switched to using an embedded setup. However, this approach significantly affected performance: tests that took 3 minutes on Linux ballooned to 8 minutes on macOS and were even slower on Windows. After analyzing this, it seems PostgreSQL performs poorly on filesystems other than ext4. Even with optimizations (fsync=off, synchronous_commit=off, and running on a ramdisk), the performance lagged. The main culprit appears to be the CREATE DATABASE FROM TEMPLATE command used in every test. In tests on my MacBook Pro, CREATE DATABASE FROM TEMPLATE took around 30ms in a Linux VM but consistently took 100ms natively on macOS. A possible solution is creating databases in bulk, which reduces latency due to amortization. In my macOS experiments, creating 250 databases in a batch averaged around 10ms per creation. Once built into our test utilities, this approach could speed up Linux tests, too, by maintaining a pool of databases in the background for quick use.
  • There are subtle OS differences causing some tests to fail on macOS and Windows, though I haven’t fully investigated the root causes.
  • Converting the few hundred in-memory-only tests to PostgreSQL has had minimal impact on CI runtime. Both the main branch and the branch without dbmem complete in around 3 minutes. Here’s a CI run for reference.

Additionally, I discovered that the 10s+ latency reported by some developers on individual tests was likely due to each test instance creating a new PostgreSQL container if DB_FROM wasn’t set.

My plan is to proceed with a series of PRs to:

  • Reuse a single PostgreSQL container and template database across tests chore: add postgres template caching for tests #15336.
  • Enable PostgreSQL in CI for Windows, macOS, and race tests. Initially, these will be non-blocking, as they may still be unstable, but this should give us visibility into their status.
  • Migrate the entire test suite to use PostgreSQL, with each PR addressing a specific part of the codebase.
  • Fix failing tests on Windows and macOS.
  • Optimize test performance on Windows and macOS.
  • Once CI is stable, performance is optimized, and all tests are using PostgreSQL, remove the dbmem code entirely.

hugodutka added a commit that referenced this issue Nov 4, 2024
This PR is the first in a series aimed at closing
[#15109](#15109).

### Changes

- **Template Database Creation:**  
`dbtestutil.Open` now has the ability to create a template database if
none is provided via `DB_FROM`. The template database’s name is derived
from a hash of the migration files, ensuring that it can be reused
across tests and is automatically updated whenever migrations change.

- **Optimized Database Handling:**  
Previously, `dbtestutil.Open` would spin up a new container for each
test when `DB_FROM` was unset. Now, it first checks for an active
PostgreSQL instance on `localhost:5432`. If none is found, it creates a
single container that remains available for subsequent tests,
eliminating repeated container startups.

These changes address the long individual test times (10+ seconds)
reported by some users, likely due to the time Docker took to start and
complete migrations.
hugodutka added a commit that referenced this issue Dec 3, 2024
This PR is the second in a series aimed at closing
#15109.

## Changes

- adds `scripts/embedded-pg/main.go`, which can start a native Postgres
database. This is used to set up PG on Windows and macOS, as these
platforms don't support Docker in Github Actions.
- runs the `test-go-pg` job on macOS and Windows too
- adds the `test-go-race-go` job, which runs race tests with Postgres on
Linux
hugodutka added a commit that referenced this issue Jan 8, 2025
Another PR to address #15109.

- adds the DisableForeignKeysAndTriggers utility, which simplifies
converting tests from in-mem to postgres
- converts the dbauthz test suite to pass on both the in-mem db and
Postgres
dannykopping added a commit that referenced this issue Jan 20, 2025
We have an effort underway to replace `dbmem` (#15109), and consequently
we've begun running our full test-suite (with Postgres) on all supported
OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of
our builds on `main` (note how the Windows/MacOS failures account for
the vast majority of failed builds):


![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0)

We're still investigating why these OSs are a lot less reliable. It's
likely that the VMs on which the builds are run have different
characteristics from our Ubuntu runners such as disk I/O, network
latency, or something else.

**In the meantime, we need to start trusting CI failures in `main`
again, as the current failures are too noisy / vague for us to
correct.**

We've also considered hosting our own runners where possible so we can
get OS-level observability to rule out some possibilities.

See the [meeting
notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)
where we linked into this for more detail.

This PR introduces several changes:

1. Moves the full test-suite with Postgres on Windows/MacOS to the
`nightly-gauntlet` workflow
tradeoff: this means that any regressions may be more difficult to
discover since we merge to main several times a day
2. Run only the CLI test-suite on each PR / merge to `main` on
Windows/MacOS
3. `test-go` is still running the full test-suite against all OSs
(including the CLI ones), but will soon be removed once #15109 is
completed since it uses `dbmem`
4. Changes `nightly-gauntlet` to run at 4AM: we've seen several
instances of the runner being stopped externally, and we're _guessing_
this may have something to do with the midnight UTC execution time, when
other cron jobs may run
5. Removes the existing `nightly-gauntlet` jobs since they haven't
passed in a long time, indicating that nobody cares enough to fix them
and they don't provide diagnostic value; we can restore them later if
necessary

I've manually run both these new workflows successfully:

- `ci`:
https://github.com/coder/coder/actions/runs/12825874176/job/35764724907
- `nightly-gauntlet`:
https://github.com/coder/coder/actions/runs/12825539092

---------

Signed-off-by: Danny Kopping <[email protected]>
Co-authored-by: Muhammad Atif Ali <[email protected]>
hugodutka added a commit that referenced this issue Jan 20, 2025
Another PR to address #15109.

Changes:
- Introduces the `--ephemeral` flag, which changes the Coder config
directory to a temporary location. The config directory is where the
built-in PostgreSQL stores its data, so using a new one results in a
deployment with a fresh state.

The `--ephemeral` flag is set to replace the `--in-memory` flag once the
in-memory database is removed.
SasSwart pushed a commit that referenced this issue Jan 22, 2025
We have an effort underway to replace `dbmem` (#15109), and consequently
we've begun running our full test-suite (with Postgres) on all supported
OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of
our builds on `main` (note how the Windows/MacOS failures account for
the vast majority of failed builds):


![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0)

We're still investigating why these OSs are a lot less reliable. It's
likely that the VMs on which the builds are run have different
characteristics from our Ubuntu runners such as disk I/O, network
latency, or something else.

**In the meantime, we need to start trusting CI failures in `main`
again, as the current failures are too noisy / vague for us to
correct.**

We've also considered hosting our own runners where possible so we can
get OS-level observability to rule out some possibilities.

See the [meeting
notes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)
where we linked into this for more detail.

This PR introduces several changes:

1. Moves the full test-suite with Postgres on Windows/MacOS to the
`nightly-gauntlet` workflow
tradeoff: this means that any regressions may be more difficult to
discover since we merge to main several times a day
2. Run only the CLI test-suite on each PR / merge to `main` on
Windows/MacOS
3. `test-go` is still running the full test-suite against all OSs
(including the CLI ones), but will soon be removed once #15109 is
completed since it uses `dbmem`
4. Changes `nightly-gauntlet` to run at 4AM: we've seen several
instances of the runner being stopped externally, and we're _guessing_
this may have something to do with the midnight UTC execution time, when
other cron jobs may run
5. Removes the existing `nightly-gauntlet` jobs since they haven't
passed in a long time, indicating that nobody cares enough to fix them
and they don't provide diagnostic value; we can restore them later if
necessary

I've manually run both these new workflows successfully:

- `ci`:
https://github.com/coder/coder/actions/runs/12825874176/job/35764724907
- `nightly-gauntlet`:
https://github.com/coder/coder/actions/runs/12825539092

---------

Signed-off-by: Danny Kopping <[email protected]>
Co-authored-by: Muhammad Atif Ali <[email protected]>
SasSwart pushed a commit that referenced this issue Jan 22, 2025
Another PR to address #15109.

Changes:
- Introduces the `--ephemeral` flag, which changes the Coder config
directory to a temporary location. The config directory is where the
built-in PostgreSQL stores its data, so using a new one results in a
deployment with a fresh state.

The `--ephemeral` flag is set to replace the `--in-memory` flag once the
in-memory database is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants