Read Repair – a solution to eventual consistency?
I got some fairly harsh feedback for my last post, “Lies, Damned Lies and “Eventual Consistency.” It’s clear a lot of people were thrown by the assertion:
There is no such thing as eventual consistency. There is only ‘Immediate Consistency’ and ‘Wrong’.
In my job as a solutions architect I look at these issues from the point of view of business impact, not from a pure developer perspective, so perhaps that statement was a bit too inflexible. There are plenty of valid EC use cases, of course. A lot of them rely on the fact that of the possible values of ‘wrong,’ all are to some degree acceptable because of the Use Case in question. But that doesn’t change the fact that the data was wrong. Sometimes you can get away with this because as long as it’s only ‘wrongish,’ the value – while incorrect – isn’t useless.
But in scenarios where you have to make decisions or allocate finite resources, EC can introduce serious problems that require additional work that can mitigate and minimize – but not eliminate – issues that arise from having wrong answers.
My background is in Telco transaction processing. It’s a fairly unforgiving world where if you get things wrong, you can plausibly lose millions of dollars an hour. It’s also an environment in which losing customer data isn’t any more acceptable than giving the wrong answer when asked if you should connect a call or not. As a consequence, my worldview makes me very uncomfortable around EC. Which is kind of funny because the CAP theorem implies that EC is unavoidable when working with systems that involve multiple data centers that need to be highly available. Since one of the main design goals of a distributed system is survivability – i.e., loss of a data center doesn’t mean loss of the system – link outages have to regarded as a ‘normal’ event, and some form of eventual consistency is required.
In my previous job I designed distributed charging (i.e., billing) systems for the telco industry. Not only did we have to cope with the CAP theorem’s ramifications, we also worked in an environment where the required SLA of 50ms was less than a round trip between data centers on opposite sides of the US. This meant we had to make customer- and revenue-impacting decisions without being 100% up-to-date on what had happened at the other three to five data centers. I have three patents filed on how to make this work as well as the laws of physics will permit.
In such an eventual consistency environment, you have to write application- and use case-specific code to mitigate the commercial and practical implications of data being ‘wrong’. Because the side effects are already out in the wild by the time you attempt to fix them, you can’t rely on a vendor’s algorithm (such as ‘last write wins’), since it will only fix the problem inside the database. The more time and energy you devote to code to fix this, the less damaging the commercial impact of ‘wrongness’ will be, but you will never eliminate it entirely.
In addition to writing lots of EC mitigation code in your distributed application, what else can you do to mitigate the problem?
“Read Repair” was suggested as a solution by readers of my last post, but it isn’t a panacea. By the time you do ‘read repair,’ somebody, somewhere, may already have made a decision based on data you are about to change retroactively. This is the key point of my last blog post. Imagine we’re talking about a scenario in which two people were allocated seat 3A on a flight. In this case two updates claimed the same seat, and both transactions reported to their respective owners that 3A was theirs. As a consequence, there is now a heated argument between two CxOs during the boarding process, the flight attendants are trying to figure out which frequent flyer to antagonize by evicting him or her from 3A into coach, and up front the pilots are worrying that if they don’t sort this out quickly they will miss their takeoff slot, passengers will miss their connections, and a chain of chaos will spread all over the US as the day progresses. This is the ‘butterfly effect’ I referred to.
Instead we should probably take inspiration from the British Army Journal’s advice: “The best defense against the atom bomb is not to be there when it goes off” – i.e., seek to minimize the frequency of scenarios where EC is a factor.
You may be able to avoid conflicts by giving each set of data a natural ‘home’ and directing all normal traffic for that set to the same site. This means that changes will always be propagated in one direction only until that site fails. Note the system is still Active-Active, but we consciously route requests using some kind of distribution algorithm. Under ideal circumstances, homing the data can reduce incidents of ‘wrongness’ by about two orders of magnitude. Given each such incident has commercial implications, this is highly desirable.
The other approach to minimizing EC conflicts is to realize that EC only becomes inevitable when the different nodes of the system are many milliseconds apart. Provided you can create a reasonable degree of affinity between your data and your data centers, you can make sure EC conflicts are minimized. One of the advantages of VoltDB is that it allows you to have a very large multi-node cluster in a single place, but with no possibility of consistency issues – because VoltDB hides the multiple copies from the calling program and maintains a clear hierarchy among redundant copies. This means that when using VoltDB, EC is never an issue within a data center, and this can drastically reduce the total number of EC events.