Description
Problem
We have a few symmetric keys that we use for signing (and also sometimes encrypting) various payloads that don't ever get rotated after creation. We've already encountered some friction with our more security conscious customers concerning our External Provisioners and pre shared keys...and the only reason why we haven't had more pushback on our symmetric key usage is because they are simply unaware of what is happening under the hood.
We already have three features (workspace apps, peer reconnection tokens, and a key used to convert built-in users to oauth) that require key signing and it's possible we may introduce more in the future. We should take the initiative while the debt is somewhat low and implement a system for rotating these internal keys.
Proposal
We will implement a rotation schedule -- configurable by the user -- where keys will be rotated based on an expiration. We should start with a single value that dictates the schedule for all keys. Monthly will be the default. We will spawn a process on startup that checks on same cadence (every 10 minutes?) to see if any keys need to be rotated. If an active key is within 1 hour of its expiration we will create a new key and set it starts_at
equivalent to the expiration of the old key.
Implementation Notes
- Expiration is a computed value defined as
starts_at
+key_duration
, wherekey_duration
is a value provided at runtime by the user. deletes_at
will be populated when a new key is inserted for the feature. It is defined asstarts_at
from the newest key +token_duration
+1h
.- We create new keys once existing keys are within an hour of their expiration so that we have plenty of time to propagate the new key to other services (aka workspace proxies).
- Keys are valid for verifying if
now()
<deletes_at
ordeletes_at == NULL
. - Keys should only be used for signing if
starts_at
<=now()
<deletes_at
. - When a key breaches its
deletes_at
we will set thesecret
field to NULL.
The following are the various token durations for our current signing keys:
- WorkspaceApps: 1m
- OAuth account conversion: 5m
- Peer Reconnection: 24h
Schema Updates
Right now keys are part of the site_config
. I propose that we migrate them into their own proper table. The table will be called keys
with the following columns.
feature (text) | sequence (integer) | secret (text) | starts_at (timestamptz) | deletes_at (timestamptz) |
---|
Where the Primary Key is (feature, sequence)
.
The starts_at
column is a bit strange, but since we will be creating keys an hour ahead of time we should avoid using the newer keys until they've been properly propagated.
Considerations
High Availability
The query to insert new keys needs to take HA deployments into consideration. As a result we will use the RepeatableRead
isolation level along with some row locking.
Workspace Proxies
We will refetch keys by leveraging our existing RegisterWorkspaceProxyLoop. The loop runs every 15s by default so 1 hour is more than sufficient to ensure proper propagation.
Other Requirements
- Part of the startup process should be checking to see if the new rotation schedule immediately invalidates any of the existing keys' expiration and handle it accordingly
- All keys should be encrypted using
dbcrypt
- All key rotations should be audited
- We should fix our use of multiple JWT libraries. I have no opinion on which library to select but we should come to some sort of conclusion.
Implementation
- Add schema-related changes (db*, migrations, etc)
- Implement
coderd/keyrotate
package - Update workspace proxies to be compatible with key updates
- Centralize key-signing logic into
coderd/keysigning
package - Migrate keys to new
crypto_keys
table and implement remaining glue