I was asked to write a blog post about MySQL High Availability at Yahoo, particularly for writes. Our standard practice is not particularly high-tech, but we’ve been using it for over 4 years now and it has become a company-wide standard with a few exceptions.
Let me start by saying that at Yahoo! we consider a datacenter as a Single Point of Failure (SPoF). We build and manage many of our own datacenters, and we still don’t assume they are invulnerable. How many people can attest to the fact that however to configure your racks, how many redundant switches, power supplies, drives, etc. you buy, if your leased datacenter has power or network issues, you are at their mercy.
We are assuming out of the gate that we have to have a presence in at least 2 datacenters. This concept is something we call BCP or Business Continuity Planning: we want our business to continue regardless of some event that would completely destroy a datacenter. So, we require that any and all components within Yahoo are run redundantly from at least 2 datacenters that are not in the same geographical region. That way if an earthquake, hurricane, or some other natural (or otherwise) disaster affects one datacenter, the other one should be far enough away to not have been affected.
One big advantage of this way of thinking is that this level of redundancy also nicely satisfies High Availability requirements. We can take downtime in one datacenter for software upgrades or other maintenance, and still happily serve out of the other because we’ve already planned ahead and already built our BCP systems.
Because of the BCP philosophy, we immediately discount most “High availability” solutions for MySQL that people tend to gravitate towards. MySQL Cluster has its uses, but you cannot form a cluster across a WAN link [that performs well]. DRDB cannot handle WAN replication [that performs well, and is synchronous].
That isn’t to say that WAN replication can’t work with these technologies, it’s just that they aren’t built to make that problem any better than standard MySQL replication does. Ultimately the problem to solve is that writing a piece of data to two redundant locations simultaneously comes at the expense of latency. Until someone comes up with a solution to write data faster than the speed of light, we are stuck with this problem, regardless of the technology solution we try.
So, we stick with good old MySQL replication. It’s still asynchronous, but lots of people already understand it, and it does the job.
Now, what makes our solution different? Not much. The layout is this: two master databases, one in each of our two colocations. These masters replicate from each other, but we would never have more than two masters in this replication loop for the same reason we don’t use token ring networks today: one master outage would break replication in a chain of size > 2. Our slaves replicate from one of the two masters, often half of the slaves in a given colocation replicate from one of the masters, and half from the other master.
Two masters by themselves don’t present a solid solution. If our application can write to either master, we open ourselves up to autoincrement/unique key collisions. Auto_increment_offset does not fix this problem, it just band-aids it; it assumes the auto_increment key order doesn’t matter, that it’s ok for the tables on each master to not be identical, and it doesn’t address any other primary/unique key collisions.
But, key collisions (that are quite visible when they happen because replication breaks) are only the tip of the iceberg. Writing to multiple masters simultaneously opens you up race conditions of the invisible and nasty types. If you have master A and master B in circular replication and the application makes a change to record X on both A and B before the replication for that change makes it to the other, there is an issue. Imagine that the application changes X to 1 on master A and X to 2 on master B. After replication catches up, A has X=2 and B has X=1. Whoops. Replication hasn’t broken, you don’t know it happened, but your two masters (and presumably your slaves) are now inconsistent. Can it be designed/coded around? Sure, but we can’t assume that it will work for a generic solution.
So, because of this problem, we strongly discourage the use of dual-writers. Can they be useful and not a problem in some cases? Yes, but that is the exception, not the rule.
So, we are left with dual-master, single-writer. We only want our application to ever write to one of our two dual-masters at a given time; we set all our databases read_only except the active master. By doing this we gain consistency: we know if a given record is locked on the active master, no other part of the application can change it at the same time. Most programmers assume consistency when writing code for any RDBMS, so most code can’t cope without it.
The trade off, of course, is latency. If our application is similarly distributed like our databases, some of our application servers will have to cross a WAN link to write to our database. This is a problem, of course, but one that is acceptable for many web applications; Many web applications serve many more reads than writes. If those writes are slightly slower, not the biggest deal. There are also a variety of handy web tricks to hide write latency, including ajax progress bars, asynchronous/batched writes, etc. Backend daemons that need to connect to the active master typically fail over with the master; so when the active master changes colo, that daemon needs to as well.
If we need to change masters in a clean way (say, for maintenance), the process is clean and can be done with no data loss and with a brief writer outage:
Set read_only=ON on the active master. This stops writes briefly.
Wait for replication to catch up to our secondary master
Set read_only=OFF on the secondary master — writes resume
Flush connections from the old master: the application is redirected to the new master by a system we use internally.
As long as our application can handle the WAN latency, it shouldn’t matter which of the two colos have the active master.
On a failure of the current active master and if we want writes to be available immediately, we must take the risk of a data loss. We are running standard MySQL replication that is asynchronous. So, if we need to make the secondary master active while the primary is down, we cannot know if replication was caught up before the primary went down. We do assume, however, that it was close (we accomplish this by monitoring our replication). At this point our engineers have a choice: either take the (potential) data loss and bring the other master up (set read_only=OFF), or take the outage (which is only for writing to the DB in this scenario) and hope the primary comes back soon. The policy for deciding this is left up to the individual teams and their business requirements.
Once we pull the failover trigger, we are accepting a possible data loss. If the downed master never recovers, our course is clear: we simply rebuild from the remaining master and its slaves. However, if the downed master does come back online (as it usually does) and we allow it to resume replicating, we are accepting that we may have introduced the race conditions or unique key collisions (described above) into our system; we’ve effectively had dual-writers running for a brief window. The safest course of action would be to never allow the downed master back into replication, but in practice, that tends not to be practical because of the expense of recloning MySQL slaves, particularly across WAN links.
Now, what about automatic failover? This is a requirement for some systems: availability over consistency, but there are a few challenges with automated failover:
An automated failover solution that does not use some unbiased system to determine which master should be active will be susceptible to split brain issues. Imagine our two colocations each fighting to become the active master because they simply cannot communicate with each other; without some outside authority guaranteeing a given colo can make itself active, we would end up with two active colos (and two independent databases!).
An automated failover solution would need to consider a system policy for taking availability over data consistency and vice versa.
An automated failover solution might be inclined to cause failover flapping, which would be problematic.
My experience has been that this system is the most adaptable to existing mysql databases, using existing storage engines and an existing technology to handle replication. It’s likely unfashionably boring, conservative, and stable; but it has worked and continues to work for us.
In the future, I hope to be able to use technologies that improve on the standard MySQL replication by being able to repoint slaves automatically, to find which slave has the most up to date data from a downed master, and even mitigate a lot of data-loss with support for synchronous writes at least within a given colo. There are some candidates out there that are close.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at email@example.com
For media partnerships, contact mediapartners@ oreilly.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
To stay abreast of conference news and to receive email notification when registration opens, please sign up for the O'Reilly MySQL Conference Bulletin (login required).
View a complete list of O'Reilly MySQL Conference Contacts