Quick Jump Directory
- What is K-Safety?
- What is a read-only transaction and what is a write or read-write transaction?
- What is a committed transaction?
- What does ‘all live replicas’ mean?
- How can VoltDB fail from two to just one while guarding against split brain?
- When is a transaction committed?
- How fast are cross-partition transactions?
- How does VoltDB handle partial network partitions?
- How does VoltDB handle bit flip errors on disk?
- How does VoltDB handle clock skew in its agreement algorithms?
VoltDB allows users to configure redundancy within a cluster. We specify the level of redundancy using a k-safety value. The value of k represents a lower limit on the number of nodes that can fail before the cluster loses availability. A k=2 cluster can lose at least two nodes and continue to run. In order to support this, VoltDB must keep k+1 copies of all data in the cluster.
Note that VoltDB divides user data into partitions and typically there are many more partitions than nodes. A typical cluster with 5 nodes and k=1 may have 20 logical partitions of data, replicating each partition twice. So on each of the 5 nodes, there will be 8 partition replicas. Note that no two nodes will have copies of the same partitions, but each partition will have a replica somewhere in the cluster on a different node.
What is a read-only transaction and what is a write or read-write transaction?
A read-only transaction cannot modify data. A read-write transaction may modify data, but certainly doesn’t have to. For example, “update table set value=1 where key=?” won’t change any values if the value already equals 1 or if the key in question doesn’t exist.
VoltDB requires all SQL to be known before the transaction is run, barring some parameterization flexibility. By knowing the SQL in advance, VoltDB can statically analyze transactions and label them as read-only, or read-write. Note that while VoltDB procedures can involve lots of user code, VoltDB only allows the underlying relational data to be changed through SQL statements.
VoltDB uses different paths for read-only transactions and read-write transactions to optimize performance and sometimes bandwidth. Read-only transactions never require persistence to disk; they don’t need to be replicated to other clusters; and they don’t need to be replayed or rolled back after a crash.
What is a committed transaction?
As a fully ACID, distributed SQL database, VoltDB must either commit or rollback 100% of all transactions. There can be no partial applications, which means the changes made by all SQL statements must be complete and visible at all live replicas, or none of the changes can be visible.
A transaction commits once VoltDB confirms it has successfully completed at all replicas of all involved partitions. If synchronous durability is used, it must also be recorded on disk at all replicas. Once a transaction is confirmed at all replicas, VoltDB sends a confirmation message to the calling client.
What does ‘all live replicas’ mean?
VoltDB must confirm transactions that have completed at all replicas for a given partition. If a given partition is unable to confirm it completed a transaction within a user-specified time, VoltDB’s cluster membership consensus kicks in and one or more nodes are removed from the cluster. The removed nodes are not allowed to do any more work once they are removed (Jepsen testing did find some issues with this in pre-6.4 versions of VoltDB).
The result is that all replicas move in lockstep. They do the same transactions, in the same order, as fast as they can. If they fall out of lockstep they are actively ejected from the cluster.
Note this is different from systems that require a quorum of replicas to do a write. There are benefits and tradeoffs to the VoltDB approach that are intentional. First, lockstep is simpler to reason about, and it reduces a lot of coordination overhead. This makes VoltDB faster for users apps, and the simplicity makes it easier for us to develop and improve VoltDB as well.
Next, if there is only one replica left, VoltDB will allow writes to that single replica. Most distributed systems require at least three copies of all data when healthy, so a quorum can be achieved by writing to a live majority. VoltDB offers flexibility to run with only two copies and still be available if one fails. VoltDB can also run with three copies and tolerate a loss of two of those copies without stopping. This flexibility means less hardware for the same work, and more availability for the same faults. The tradeoff is that if all but one replica has faulted out of the cluster, VoltDB would only write to one replica, instead of all or a majority of replicas.
How can VoltDB fail from two to just one while guarding against split brain?
The concern with a two-node cluster is that it’s impossible for one node to tell if the other node is down, or just cut-off and still processing. To achieve consensus, we need to guarantee that one side of a network partition stops processing.
Many will tell you this can’t be done, but that’s wrong. It’s true that it’s impossible to guarantee one side stops and the other continues, but it’s totally possible to guarantee that no more than one side continues.
VoltDB identifies a “blessed node” for every current cluster state. When the current cluster partitions exactly half of the cluster away, VoltDB needs to determine if the other half of the cluster might be alive and processing. It checks if the new cluster contains the old blessed node, and will shut down if it does not. Similarly, if it does have the blessed node, it can be assured the half of the old cluster (if live) will shut down. Any continuing cluster will determine a new blessed node.
The end result is that cutting the wire between a two node cluster will leave exactly one side live. The tradeoff is that killing a node in a two node cluster has a 50% chance of shutting down the remaining node.
When is a transaction committed?
We consider a transaction committed when it is visible and durable. That is, if you send a request after commit, you can read any changes made by the transaction. If you kill a node and rejoin, the changes will also be visible. If you are using synchronous disk persistence, all committed transactions are on disk and will be visible after disk recovery.
That said, it’s not always easy to nail down precisely when this guarantee is safe, but we have some bounds. If VoltDB sends an acknowledgement that a transaction successfully committed or fully rolled back to a client, then it must be so. When there are no faults of any kind, this is the common path.
However, because VoltDB is a distributed system using a network, faults can and do happen. The client could die before it receives the acknowledgement. The VoltDB node that’s connected to the client might die. One or more nodes involved in the transaction might die. Nothing could die, but network traffic could be interrupted or cut. If the client doesn’t receive an acknowledgement, VoltDB may commit or rollback the transaction. The best way to tell is to use a follow-on transaction to inspect the state of the database. Note this isn’t really a VoltDB problem, but a problem with any stateful system that uses a network somewhere.
As an aside, the best way to deal with this problem is to make procedures idempotent and retry them until they are confirmed. The Call Center example is useful to learn about idempotence and retry-logic with VoltDB.
How fast are cross-partition transactions?
One thing Kingsbury points out on his blog post about VoltDB and Jepsen is that we have few benchmarks that address cross-partition transaction performance. Many of our example applications do use cross-partition transactions, but few really saturate this path.
As Kingsbury mentions in his blog post, users can typically expect hundreds of cross-partition writes and often tens of thousands of cross-partition reads per second. The actual numbers vary with cluster size, network performance, and of course, the work that the transactions actually do.
The most common use case for VoltDB is high volume partitionable read-write operations, mixed with lower volume cross-partition reads for global understanding as data is being processed. Cross-partition writes are often used to update dimension tables, or for less frequent secondary operations.
It’s also worth noting that the number of transactions per second is only loosely related to the number of rows read, inserted or updated per second. In a single VoltDB transaction, many rows can be updated. This means a cross-partition write throughput of 500 transactions per second may enable a cross partition insert rate of tens of thousands of rows per second.
How does VoltDB handle partial network partitions?
VoltDB maintains a TCP connection between all nodes in the cluster. For example, in a five node cluster, each node would have four TCP connections to the other four nodes.
If a TCP connection is closed using a FIN packet, or if no traffic has been received on it for a set timeout, then the node that first notices will trigger a cluster-wide fault-resolution event. In any fault-resolution event, the cluster will eventually agree on a maximally connected subgraph that doesn’t include the failed TCP connection.
If multiple faults are reported before the fault is resolved and consensus on new cluster topology is achieved, then the fault process is restarted. This is quite common. In the simple case, two nodes may report a fault on the same link between them. In complex cases, many faults from a complex network partition may be reported.
Note that if a node reports a link as faulted, at least one node using that link will be removed from the cluster. There are no “takebacks”. The way this system works, it doesn’t matter how many links fail, or whether they fail asymmetrically or symmetrically. Faults will be reported, and the fault-resolution process will run to completion.
How does VoltDB handle bit flip errors on disk?
VoltDB is somewhat hardened against disk corruption errors and has a disk usage profile that is more robust to disk corruption than many systems.
- Snapshots and command log segments are checksummed and the checksum is stored on disk with the data. These checksums are checked when snapshots are loaded and when logs are replayed.
- VoltDB never reads from disk during typical operations. Even failing and rejoining nodes is done without depending on the on-disk persistent data. On-disk snapshots and logs are only used to recover a complete cluster.
- VoltDB’s snapshots and logs on disk are not kept for long. Snapshots and logs are erased and rewritten continuously. The maximum lifetime for any recovery data is usually measured in minutes.
- A redundant VoltDB cluster typically has multiple copies of every snapshot and log. If recovery fails due to disk corruption, It is often possible to recover from another copy of data.
There are unit tests for this code, where artificially corrupted logs and snapshots are loaded into VoltDB, and we verify VoltDB detects the corruption. A future enhancement would be to add corrupted logs and snapshots to our more complex systems tests.
How does VoltDB handle clock skew in its agreement algorithms?
VoltDB doesn’t use Paxos or Raft for its core consensus protocols. It achieves agreement by creating a global order of unique ids. This idea is partly described in the original H-Store paper (pdf), though the actual system is much more robust than what is described there.
Since these unique ids depend on wall clock to some extent, VoltDB can be sensitive to clock skew. If the clocks on two different nodes are out of sync by 10ms, that can slow down consensus operations by 10ms per operation. This is why VoltDB checks for skew on startup and will refuse to start if skew is too high.
But what if the clock moves forward or backward while a cluster is running? The important thing is that the unique ids generated are monotonically increasing. Clocks moving forward don’t break this invariant, though they may slow down agreement as mentioned. Clocks moving backward are handled as well. If the clock moves backwards by less than three seconds, the system will pause until time has caught up. If it moves backwards by more than three seconds, the system may eject the node with the bad clock from the cluster.
The important thing is that clock movement can’t break any of the consistency guarantees of VoltDB; it only affects the throughput of certain agreement operations. If clocks are out of whack, it may take longer to change schema, or request a user snapshot. Regular transaction throughput should not be affected. In practice, using NTP properly, especially with the -x option, can minimize any issues.
VoltDB has been using this system in production since 2011. We’ve dealt with all manner of forwards and backwards clock movement, and seen first hand what a fantastic stress test running in Amazon Web Services can be. We even successfully stayed up through multiple leap seconds and are looking forward to another one at the end of 2016. This code has been tested by us, and in production.
- Read the VoltDB Jepsen Blog: VoltDB 6.4 Passes Official Jepsen Testing
- Read the Reasons behind the VoltDB Architecture.
- Discuss Jepsen & VoltDB at Hacker News.