refactor of channel persistence to use UUIDs #2028

slingamn · 2023-01-04T10:23:16Z

This is pretty complicated and touches a lot of things, so I'm interested in reviews, maybe from @ajaspers or @progval ?

The core idea here is to refactor persistence to eventually support datastores other than buntdb. There are two problems:

We don't want typical chat operations to block on the datastore (this is mostly implemented already, with the asynchronous persistence / markDirty stuff, but there are some cases where we would still block, particularly for accounts where every login currently incurs a read from the datastore)
We don't want to require nontrivial consistency guarantees from the datastore

The new approach is best illustrated by the new, weak datastore API:

https://github.com/slingamn/ergo/blob/7ce06362764ee35629521eacc1fdee5405370efd/irc/datastore/datastore.go

which exposes key-value pairs. Each key has a UUID and is associated with a "table". There are four operations:

Read everything from a table. This is used at ircd startup to read all persisted data. The source of truth then becomes the in-memory datastructures, with asynchronous persistence back to the datastore
Set a key, with an optional TTL that will be respected by the datastore
Delete a key
Read a single key (this is used for some edge cases, like schema changes)

This branch refactors channels and channel purge records to use the new API.

ajaspers · 2023-01-08T18:01:14Z

irc/channelmanager.go

 				return nil, errInsufficientPrivs, false
 			}
 			// enforce confusables
-			if !registered && (cm.chansSkeletons.Has(skeleton) || cm.registeredSkeletons.Has(skeleton)) {


It looks like this removes the exception for registered channels which are confusable with another channel. Is that intentional?

The idea is that registered channels are always loaded now, and therefore always present in chans and chansSkeletons now (even when they are purged), so there's no need to treat them differently than other channels.

ajaspers · 2023-01-08T18:06:55Z

irc/channelmanager.go

-	return nil
+	// TODO we need a better story about error handling for later
+	if err = cm.server.dstore.Set(datastore.TableChannelPurges, record.UUID, purgeBytes, time.Time{}); err != nil {
+		cm.server.logger.Error("datastore", "couldn't store purge record", chname, err.Error())


Shouldn't this at least return an error so that the oper knows something is wrong?

The current story about this is not ideal. I haven't fully decided about what to do here, but I think in general, datastore failures will not necessarily cause the underlying operation to fail in full. For example, in the case of CS PURGE ADD, the purge actually gets added to the in-memory datastructure no matter what and will be enforced as long as the ircd is running.

I think long-term (once I actually introduce a datastore where writes can fail for non-catastrophic reasons), the strategy will be:

Have a (bounded) queue for asynchronously retrying sets and deletes

Have an option for alerting the operator to failed datastore operations (a snomask?)

ajaspers · 2023-01-08T18:07:17Z

irc/channelmanager.go

 		return errNoSuchChannel
 	}
+	if err := cm.server.dstore.Delete(datastore.TableChannelPurges, record.UUID); err != nil {
+		cm.server.logger.Error("datastore", "couldn't delete purge record", chname, err.Error())


ajaspers · 2023-01-08T18:08:52Z

irc/channelmanager.go

+
+	for cfname, entry := range cm.chans {
+		if entry.channel.Founder() == account {
+			channels = append(channels, cfname)


Any performance concern about this function being O(n)? It's called from user-facing functions like checkChanLimit.

I felt a bit conflicted about this. It's no worse than LIST is currently. On the other hand, LIST is commonly special-cased by fakelag systems for being more expensive than other commands (although possibly for bandwidth reasons, not CPU utilization reasons)?

I think it's probably OK to leave this unoptimized for now. We could put a rate limit on the relevant chanserv operations (and LIST?) if it becomes a problem.

ajaspers · 2023-01-09T00:09:14Z

irc/channelmanager.go

 	}
-	return nil
+	// TODO we need a better story about error handling for later
+	if err = cm.server.dstore.Set(datastore.TableChannelPurges, record.UUID, purgeBytes, time.Time{}); err != nil {


In general, it seems this is following the model of updating the in-memory state first, then the database. Couldn't this lead to race conditions that result in the in-memory state being different from the database?

The idea here is to follow the same pattern established with MarkDirty:

ergo/irc/channel.go

Line 191 in 1e6dee1

// MarkDirty marks part (or all) of a channel's data as needing to be written back

Each update should be idempotent (it should always persist the latest data corresponding to its key, in full). So the only possible race condition is if two writes to the datastore are reordered relative to each other. This is prevented through the use of a semaphore, e.g. (*Channel).writebackLock, which ensures a linear sequence of operations [copy state, write to datastore, copy state, write to datastore ...]

One thing I'm still not totally sure about is whether there would still be a race condition under relaxed consistency modes in Cassandra (e.g. QUORUM or LOCAL_QUORUM). Under those conditions, could an earlier write could "win" the eventual consistency despite having happened-before a later write in Go?

In any case, purges don't need a semaphore of their own because they can't be updated, only deleted.

It sounds like in Cassandra, this is guaranteed under normal conditions by the linearity of the local commitlog:

https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html

but under hardware failure or extended partition, there may still be data loss:

https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesTOC.html

refactor of channel persistence to use UUIDs

7ce0636

slingamn added this to the v2.12.0 milestone Jan 4, 2023

ajaspers reviewed Jan 9, 2023

View reviewed changes

slingamn merged commit 4317016 into ergochat:master Jan 15, 2023

slingamn deleted the channels_taketwo.1 branch January 28, 2025 07:27

slingamn mentioned this pull request Jun 11, 2025

metadata-2 #2273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor of channel persistence to use UUIDs #2028

refactor of channel persistence to use UUIDs #2028

Uh oh!

slingamn commented Jan 4, 2023

Uh oh!

ajaspers Jan 8, 2023

Uh oh!

slingamn Jan 9, 2023

Uh oh!

ajaspers Jan 8, 2023

Uh oh!

slingamn Jan 9, 2023

Uh oh!

ajaspers Jan 8, 2023

Uh oh!

ajaspers Jan 8, 2023

Uh oh!

slingamn Jan 9, 2023

Uh oh!

ajaspers Jan 9, 2023

Uh oh!

slingamn Jan 9, 2023

Uh oh!

slingamn Jan 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor of channel persistence to use UUIDs #2028

refactor of channel persistence to use UUIDs #2028

Uh oh!

Conversation

slingamn commented Jan 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants