Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tgross
Copy link
Contributor

@tgross tgross commented Sep 8, 2016

This PR changes the failover mechanism to be coordinated by mysqlrpladmin failover, which ensures that the transaction state is properly synced for the new master (at the expense of write availability during failover, which is part of our design anyways).

In order to make this project sanely testable, this work has included a refactoring to split the 1000+ lines of code into modules and classes that can have dependencies injected. A new unit test suite includes using sys.settrace hooks to single-step thru simulated separate processes. (Turns out this was totally unnecessary!)

@misterbisson as an FYI but the work isn't yet complete (note this has been rebased a bunch of times so the commit dates are fubar).

The MySQLUtilities package uses the Connector/Python library, which has a
namespace collision with the PyMySQL library unless we install them in
separate virtualenvs (which will complicate and bloat the container more).

manage.py can use Connector/Python with minimal changes, mostly just working
around bugs in sending multiple statements in a single `execute()` call.
Also, remove extra timestamp from manage.py logs
The existing code base has serious testability problems because it grew
organically around a lot of global state. This refactoring moves most
of the logic into separate classes that we can configure via DI and splits
the classes out to their own modules for readability.
@tgross
Copy link
Contributor Author

tgross commented Sep 9, 2016

----------------------------------------------------------------------
Ran 37 tests in 7.137s

OK

@misterbisson at this point I have a passing unit test suite that gives us solid coverage of our configuration loading, pre_start, health, on_change, and snapshot_task, including a bunch of different failover scenarios. The idea behind this test suite is that it tests the algorithm we're using without worrying about the success of execution of MySQL commands, which we'll leave for the integration tests.

Starting next week I'll make sure this all works in the integration test suite and hands-on testing.

@tgross tgross changed the title [WIP] Failover improvements Failover improvements Sep 12, 2016
@tgross
Copy link
Contributor Author

tgross commented Sep 12, 2016

After fixing a couple dumb mistakes in my Python module layout, and a few real bugs, I now have successful failovers. The next step is to make sure the Shippable integration tests still work and update the README with the design changes.

{
"name": "snapshot_check",
"command": "python /usr/local/bin/manage.py snapshot_task",
"frequency": "10s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking to come back to: why the increase from 10s to 5m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the first pass thru the health check for the primary we do an initial snapshot so we can bootstrap replication (at the end of run_as_primary). Before we moved the snapshot into its own task we also checked if we needed a snapshot at the end of the health check. This worked fine because we'd already completed the run_as_primary steps. But when we moved it into its own task it overlaps with the health check, which means it can start to run before we've completed run_as_primary and this creates a ton of logging noise and errors. By moving it to 5 min we know that the initial setup has been completed without having to recheck it every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: at some point I'd like to look into improving this so we do incremental backups rather than full snapshots.

@tgross
Copy link
Contributor Author

tgross commented Sep 14, 2016

At this point I've got both unit tests and integration tests working on local Docker:

# make unit-test
.....................................
----------------------------------------------------------------------
Ran 37 tests in 1.100s

OK
# make test-local-docker
----------------------------------------------------------------------
MySQLStackTest.test_replication_and_failover
----------------------------------------------------------------------
elapsed  | task
1.817438 | docker-compose -f local-compose.yml -p my up -d
0.274307 | docker-compose -f local-compose.yml -p my ps
0.012382 | docker inspect my_consul_1
58.39401 | wait_for_service: mysql-primary 1
0.262109 | docker-compose -f local-compose.yml -p my ps -q mysql
0.138656 | docker exec d7f17f4770702404c912e2cb4edd0b5870792b453ae299040dd2dc0567e5b528 ip -o ad
0.405336 | assert_consul_correctness:
1.370994 | docker-compose -f local-compose.yml -p my scale mysql=3
26.12946 | wait_for_service: mysql 2
0.474732 | docker-compose -f local-compose.yml -p my ps -q mysql
0.186038 | docker exec d7f17f4770702404c912e2cb4edd0b5870792b453ae299040dd2dc0567e5b528 ip -o ad
0.076802 | docker exec 284f4a18a53fd665eabfe9f9b05c626527cba31d3c4fdc6221adb8886c443044 ip -o ad
0.072800 | docker exec 6383f672f14e586cde650932f3778e4fa76b94446aec43237e779a4a28563d58 ip -o ad
0.815884 | assert_consul_correctness:
0.242205 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e CREATE TABLE tbl1 (
0.124025 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1 (f
0.106777 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1 (f
0.103360 | docker exec 284f4a18a53f mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl
0.115371 | docker exec 6383f672f14e mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl
3.884255 | docker stop my_mysql_1
9.044485 | wait_for_service: mysql-primary 1
0.299285 | docker-compose -f local-compose.yml -p my ps -q mysql
0.013245 | docker exec d7f17f4770702404c912e2cb4edd0b5870792b453ae299040dd2dc0567e5b528 ip -o ad
0.104511 | docker exec 284f4a18a53fd665eabfe9f9b05c626527cba31d3c4fdc6221adb8886c443044 ip -o ad
0.077443 | docker exec 6383f672f14e586cde650932f3778e4fa76b94446aec43237e779a4a28563d58 ip -o ad
0.499566 | assert_consul_correctness:
0.002784 | wait_for_service: mysql 1
0.312183 | docker-compose -f local-compose.yml -p my ps -q mysql
0.013305 | docker exec d7f17f4770702404c912e2cb4edd0b5870792b453ae299040dd2dc0567e5b528 ip -o ad
0.103166 | docker exec 284f4a18a53fd665eabfe9f9b05c626527cba31d3c4fdc6221adb8886c443044 ip -o ad
0.120489 | docker exec 6383f672f14e586cde650932f3778e4fa76b94446aec43237e779a4a28563d58 ip -o ad
0.553040 | assert_consul_correctness:
0.116433 | docker exec 284f4a18a53f mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1
0.073758 | docker exec 6383f672f14e mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl
.
----------------------------------------------------------------------
Ran 1 test in 108.592s

OK

My tests aren't working on Triton right now but that's because of a setup problem (something to do with my credentials in the test environment... digging into it) and not a problem with the application.

@tgross
Copy link
Contributor Author

tgross commented Sep 14, 2016

@misterbisson I've pushed a big update to the README in this branch, which describes the new failover process and also outlines some of the guarantees and limitations of our setup.

README.md Outdated

It's very important to note that the failover process described above prevents data corruption by ensuring that all replicas have the same set of transactions before continuing. But because MySQL replication is asynchronous it cannot protect against data *loss*. It's entirely possible for the primary to fail without any replica having received its last transactions. This is an inherent limitation of MySQL asynchronous replication and you must architect your application to take this into account.

Also note that during failover, the MySQL cluster is unavailable for writes. Any client application should be using ContainerPilot or some other means to watch for changes to the `mysql-primary` service and halt writes until the failover is completed. Writes sent to a failed primary during failover will be lost!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writes sent to a failed primary during failover will be lost!

Clarify: the primary will already be removed from Consul at that point, right? There is a clearly a race condition around the moment of failure, but once a primary is identified as failed, Consul won't report it as a primary anymore.

I think you're right to raise the warning here, perhaps I'm being defensive about making sure we know where the problem is.

@misterbisson
Copy link
Contributor

This is looking solid all around.

I didn't see any changes here that would affect the configuration in https://github.com/autopilotpattern/wordpress. Am I missing anything? Is this gh39_use_standby-e8972e5 in https://hub.docker.com/r/autopilotpattern/mysql/tags/? If so, I should test it in the context of the WP implementation, yes?

@tgross
Copy link
Contributor Author

tgross commented Sep 14, 2016

I didn't see any changes here that would affect the configuration in https://github.com/autopilotpattern/wordpress. Am I missing anything?

Configuration should be the same.

Is this gh39_use_standby-e8972e5 in https://hub.docker.com/r/autopilotpattern/mysql/tags/? If so, I should test it in the context of the WP implementation, yes?

That tag is on the Hub and it sounds like a swell idea to test WP with it.

Still trying to figure out why make test-local-triton (which runs the test container locally but runs MySQL on Triton in us-sw-1) is giving me credentials-related errors.

@tgross
Copy link
Contributor Author

tgross commented Sep 14, 2016

Passing integration test suite on Triton:

----------------------------------------------------------------------
MySQLStackTest.test_replication_and_failover
----------------------------------------------------------------------
elapsed  | task
35.53385 | docker-compose -f docker-compose.yml -p my up -d
2.447364 | docker-compose -f docker-compose.yml -p my ps
0.940669 | docker inspect my_consul_1
65.88641 | wait_for_service: mysql-primary 1
3.004218 | docker-compose -f docker-compose.yml -p my ps -q mysql
3.494657 | docker exec c84516de92f5403ba0da4c733ca949 ip -o addr
6.676664 | assert_consul_correctness:
39.38328 | docker-compose -f docker-compose.yml -p my scale mysql=3
10.98797 | wait_for_service: mysql 2
4.712621 | docker-compose -f docker-compose.yml -p my ps -q mysql
3.667386 | docker exec c84516de92f5403ba0da4c733ca949 ip -o addr
3.197954 | docker exec 92c023506b9d4000863c3aef4ba534 ip -o addr
3.228828 | docker exec f5e2ea5642354ba6a3114f13d9007a ip -o addr
14.97571 | assert_consul_correctness:
3.051047 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e CREATE TABLE tbl1 (field1 INT, demodb
2.951622 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1 (field1, fiel demodb
3.046466 | docker exec my_mysql_1 mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1 (field1, fiel demodb
3.009918 | docker exec 92c023506b9d mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl1 WHERE `fiel demodb
3.151301 | docker exec f5e2ea564235 mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl1 WHERE `fiel demodb
10.99436 | docker stop my_mysql_1
5.538805 | wait_for_service: mysql-primary 1
4.406038 | docker-compose -f docker-compose.yml -p my ps -q mysql
1.983925 | docker exec c84516de92f5403ba0da4c733ca949 ip -o addr
3.001627 | docker exec 92c023506b9d4000863c3aef4ba534 ip -o addr
3.110972 | docker exec f5e2ea5642354ba6a3114f13d9007a ip -o addr
12.68047 | assert_consul_correctness:
0.089648 | wait_for_service: mysql 1
4.462744 | docker-compose -f docker-compose.yml -p my ps -q mysql
6.202129 | docker exec c84516de92f5403ba0da4c733ca949 ip -o addr
3.279309 | docker exec 92c023506b9d4000863c3aef4ba534 ip -o addr
3.434999 | docker exec f5e2ea5642354ba6a3114f13d9007a ip -o addr
17.55658 | assert_consul_correctness:
3.603358 | docker exec f5e2ea564235 mysql -u dbuser -p<redacted> --vertical -e INSERT INTO tbl1 (field1, fiel demodb
3.511389 | docker exec 92c023506b9d mysql -u dbuser -p<redacted> --vertical -e SELECT * FROM tbl1 WHERE `fiel demodb
17.82098 | docker-compose -f docker-compose.yml -p my stop
15.41207 | docker-compose -f docker-compose.yml -p my rm -f
.
----------------------------------------------------------------------
Ran 1 test in 282.843s

OK

@misterbisson
Copy link
Contributor

gh39_use_standby-e8972e5 works for https://github.com/autopilotpattern/wordpress, though I wasn't able to upgrade a running MySQL cluster to the new version. We probably need an upgrade note in the readme making that clear.

@tgross
Copy link
Contributor Author

tgross commented Sep 15, 2016

Added a section to the README about upgrades and also added a table of contents to the top of the README.

@misterbisson
Copy link
Contributor

🏡 🚶

@tgross tgross merged commit 25a0b14 into autopilotpattern:master Sep 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants