-
Notifications
You must be signed in to change notification settings - Fork 64
Failover improvements #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The MySQLUtilities package uses the Connector/Python library, which has a namespace collision with the PyMySQL library unless we install them in separate virtualenvs (which will complicate and bloat the container more). manage.py can use Connector/Python with minimal changes, mostly just working around bugs in sending multiple statements in a single `execute()` call.
Also, remove extra timestamp from manage.py logs
The existing code base has serious testability problems because it grew organically around a lot of global state. This refactoring moves most of the logic into separate classes that we can configure via DI and splits the classes out to their own modules for readability.
@misterbisson at this point I have a passing unit test suite that gives us solid coverage of our configuration loading, Starting next week I'll make sure this all works in the integration test suite and hands-on testing. |
|
After fixing a couple dumb mistakes in my Python module layout, and a few real bugs, I now have successful failovers. The next step is to make sure the Shippable integration tests still work and update the README with the design changes. |
| { | ||
| "name": "snapshot_check", | ||
| "command": "python /usr/local/bin/manage.py snapshot_task", | ||
| "frequency": "10s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking to come back to: why the increase from 10s to 5m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of the first pass thru the health check for the primary we do an initial snapshot so we can bootstrap replication (at the end of run_as_primary). Before we moved the snapshot into its own task we also checked if we needed a snapshot at the end of the health check. This worked fine because we'd already completed the run_as_primary steps. But when we moved it into its own task it overlaps with the health check, which means it can start to run before we've completed run_as_primary and this creates a ton of logging noise and errors. By moving it to 5 min we know that the initial setup has been completed without having to recheck it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: at some point I'd like to look into improving this so we do incremental backups rather than full snapshots.
|
At this point I've got both unit tests and integration tests working on local Docker: My tests aren't working on Triton right now but that's because of a setup problem (something to do with my credentials in the test environment... digging into it) and not a problem with the application. |
|
@misterbisson I've pushed a big update to the README in this branch, which describes the new failover process and also outlines some of the guarantees and limitations of our setup. |
README.md
Outdated
|
|
||
| It's very important to note that the failover process described above prevents data corruption by ensuring that all replicas have the same set of transactions before continuing. But because MySQL replication is asynchronous it cannot protect against data *loss*. It's entirely possible for the primary to fail without any replica having received its last transactions. This is an inherent limitation of MySQL asynchronous replication and you must architect your application to take this into account. | ||
|
|
||
| Also note that during failover, the MySQL cluster is unavailable for writes. Any client application should be using ContainerPilot or some other means to watch for changes to the `mysql-primary` service and halt writes until the failover is completed. Writes sent to a failed primary during failover will be lost! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes sent to a failed primary during failover will be lost!
Clarify: the primary will already be removed from Consul at that point, right? There is a clearly a race condition around the moment of failure, but once a primary is identified as failed, Consul won't report it as a primary anymore.
I think you're right to raise the warning here, perhaps I'm being defensive about making sure we know where the problem is.
|
This is looking solid all around. I didn't see any changes here that would affect the configuration in https://github.com/autopilotpattern/wordpress. Am I missing anything? Is this |
Configuration should be the same.
That tag is on the Hub and it sounds like a swell idea to test WP with it. Still trying to figure out why |
|
Passing integration test suite on Triton: |
|
|
|
Added a section to the README about upgrades and also added a table of contents to the top of the README. |
|
🏡 🚶 |
This PR changes the failover mechanism to be coordinated by
mysqlrpladmin failover, which ensures that the transaction state is properly synced for the new master (at the expense of write availability during failover, which is part of our design anyways).In order to make this project sanely testable, this work has included a refactoring to split the 1000+ lines of code into modules and classes that can have dependencies injected.
A new unit test suite includes using(Turns out this was totally unnecessary!)sys.settracehooks to single-step thru simulated separate processes.@misterbisson as an FYI but the work isn't yet complete (note this has been rebased a bunch of times so the commit dates are fubar).