One Thread – One Task at a Time
There are many ways to make the cost of concurrency lower, but few met the team’s needs. To go 10X faster, concurrency costs needed to be all but eliminated. Since there’s only one way to do that, the choice was obvious.
All data operations in VoltDB are single-threaded, running each operation to completion before starting the next. Fancy concurrent B-Trees and lockless data structures are replaced by simple data structures with no thread safety or concurrent access costs. Not only does this meet speed goals, it’s actually much simpler to build, test and modify, a double-win.
Note that this choice is only really possible with a memory-centric design. The latency of reading and writing from disk is often hidden by shared memory multi-threading, but when running single-threaded, that latency will cause the CPU to spend most of its time idle. Eliminating disk waits allowed us to keep the CPU saturated, and run rings around traditional architectures.
No Waiting on Users
As an operational store, the VoltDB “operations” in question are actually full ACID transactions, with multiple rounds of reads, writes and conditional logic. If the system is going to run transactions to completion, one after another, disk latency isn’t the only stall that must be eliminated; it is also necessary to eliminate waiting on the user mid-transaction.
That means external transaction control is out – no stopping a transaction in the middle to make a network round-trip to the client for the next action. The team made a decision to move logic server-side and use stored procedures.
There are some drawbacks here. External transaction control can make certain problems easier. Splitting business logic between the client and server requires additional organization. Also, most object-relational-mappers (ORMs) rely on external transaction control.
There are no perfect answers to these questions. The good news is these problems apply to any system designed to scale. One of the common ways to make any RDBMS faster is to reduce contention by eliminating external transaction control. At scale, the system shouldn’t check in with the client over a network while it’s holding locks. And while there are many criticisms of stored procedures, VoltDB addresses many of them. Rather than product-specific SQL-derived languages, VoltDB uses Java as its procedure language. Java is fast, familiar, and has a huge ecosystem. It’s even possible to step through stored procedure code in an Eclipse IDE.
Much has been written about ORMs, which are famous for impeding scale. They make it easy to build large complex applications that work for low to moderate throughput, but quickly become the point of contention at scale.
At a fundamental level, stored procedures move code to the data, and external transaction control moves data to the code. One of these things is much bigger than the other, so that’s why VoltDB was designed to move code, not data.
Concurrency through Scheduling, Not Shared Memory
The team building VoltDB solved the concurrency problem by doing one thing at a time using a single-threaded pipeline. But how does that reconcile with multi-core hardware?
First, the decision was made to have VoltDB scale to multiple machines. The team needed to figure out a way to manage multiple single-threaded pipelines across a cluster. To deal with multi-core, why not simply move the abstraction? Treat four machines with four cores each as sixteen machines. Each core gets a single-threaded pipeline, and build the system to manage them in aggregate. The database thus can scale to lots of cores, lots of nodes or both. All the while, there are no concurrent data structures, locks or other expensive synchronization in the data management layer.
The next goal was to keep the pipelines full. VoltDB does this by partitioning the data. Say a finance app is partitioned by ticker symbol; now all operations on “MSFT” can be directed toward the appropriate pipeline. If an app is partitioned by customer, an operation on a specific customer’s data can be routed easily to the right pipeline. It’s also possible to operate on data across pipelines by breaking the work up into per-pipeline chunks and processing per-pipeline results into global results. For more details on how transactions work in a partitioned environment.
VoltDB calls this approach “concurrency through scheduling.” There are still threads and locks in the system, mostly to support networking and scheduling transactions, but all SQL and Java procedure logic is single-threaded and blindingly fast. The locks are there to schedule things that are low-contention; they protect small scheduling data structures, rather than user data for a transaction.
Modern Disk Persistence
While many data sets fit in memory, there is an impermanence to memory that makes belt-and-suspenders engineering types nervous. VoltDB needed disk persistence to be ready for the enterprise. But the team had to determine how to use disks without crippling performance. They asked: What does a modern approach look like?
For starters, VoltDB barely reads from disk at all, so much of the workload of a traditional system is removed. Second, VoltDB disk IO is almost 100% append-only streaming writes. Even spinning disks can sustain high write throughput when used this way. Third, disk IO is almost entirely parallel to the operational workload. The system is designed to almost never block on disk synchronization.
This is achieved through two mechanisms: background snapshots, and inter-snapshot logical logging. VoltDB background snapshots are transactional, and serialize data to disk at a single, logical point-in-time, cluster-wide. They proceed at the speed of disk; a slower disk will take fewer snapshots per hour. They also don’t block ongoing operational work.
Logical logging protects data that mutates between snapshots. A logical log of all write operations is streamed to disk. If an entire cluster fails, the most recent snapshot is reloaded, followed by a replay of the logical log to bring the cluster back to the point of failure. This logical log has a huge throughput advantage over binary logs, partly because it is bounded in size, but mostly because disk IO can begin before the operational work is started. Binary logs must wait until an operation has completed to log to disk. Combined with a group commit mechanism, VoltDB’s logging is remarkably low impact, allowing millions of writes per second, per node with synchronous disk persistence.
High Availability & Leveraging Determinism
While disk persistence is great, in a parallel, shared-nothing cluster, it’s imperative to keep data in more than one place. High availability is the promise that VoltDB will continue to run in the face of many common hardware and network failures. Through replication, VoltDB achieves high availability and additional data safety.
The typical method of replication in shared-nothing clusters is to use a master node and one or more replica nodes. Changes are first applied to the master, then synchronously or asynchronously sent to the replica(s). This approach has several downsides. If replication is synchronous, additional latency costs are added by replication. If replication is asynchronous, data may be lost when the master fails.
VoltDB uses synchronous replication without the latency overhead. It does this by being a little unconventional. Since all operations (transactions) in VoltDB are self-contained and deterministic, it’s enough to simply replicate an ordered list of operations to do, and let each replica do the same work. All replicas do the same work in the same order at the same time in parallel. Responses are sent to the client when all replicas confirm completion. VoltDB ensures that operations are deterministic and replica state is identical in several ways, including distributed hashing of data mutation.
Replica mastership merely implies a replica within a set of replicas is responsible for ordering deterministic writes. Node failure will trigger a fault detection and resolution mechanism that re-elects masters for all replica sets. There is a cost to replication in terms of duplicated hardware, but almost zero cost in terms of latency, and throughput may actually increase.
To continue exploring VoltDB, we encourage you to visit our Developer Central section or download a free trial.