Reasons Behind the VoltDB Architecture

When we tell people how fast VoltDB is - millions of multi-statement ACID transactions per second, with full disk persistence - they often ask us, "What's the catch?"

The purpose of this in-memory database architecture overview is to explain the how behind VoltDB performance. What are the innovations? Where are the tradeoffs?

Database Superfriends

In 2007, the research team behind the Aurora Complex Event Processing system (commercialized as Streambase and the C-Store Analytics (OLAP) system (commercialized as Vertica) turned their attention to the problem of building an operational database for the 21st century.

Led by Turing Laureate Mike Stonebraker, the academic team behind the original VoltDB research included Dan Abadi, Pat Helland, Sam Madden, Andy Pavlo, and Stan Zdonik. The team, at this writing, has more than 150 years of collective database experience.

As with their previous projects, the team focused its research on two basic ideas.

  • How has computing changed in the 21st century and how can we leverage that to build a better system?
  • How can we make the system better by adding focus? If our goal is not to solve every problem in the world, but to focus on operational stores, how much faster can we go? 

Memory, Memory, Memory

The team's first key insight was that memory was getting cheaper faster than operational stores were growing. When systems such as MySQL were designed, memory was costly, used mostly as a sliding window over slower, disk-encumbered state. Today, many operational datasets fit in memory, and tomorrow even more will. While analytical datasets are exploding in size, the operational systems of the future will be memory-centric.

But there's a catch. While memory might be 100 times faster than SSDs and 10,000 times faster than spinning disks, in-memory datastores aren't similarly faster. RDBMSs built to be in-memory disappointed with less than 10X performance increases. Experimenters who backed traditional systems like MySQL with a RAM-based filesystem were often similarly disappointed. What was holding back these systems?

To find out, the researchers did something rather clever. They took an open source RDBMS that followed traditional RDBMS architecture, ran it on a memory-backed filesystem, and measured where it spent its time. The research showed that traditional databases spent less than 10% of their time doing actual work; most of the time was spent in two places.

  • Page Buffer Management: The page buffer system assigns database records to fixed-size pages, organizes their placement within pages, and then manages which pages are loaded into memory and which are on disk. Additionally, pages must be tracked as dirty or clean as they are read and written to. This entire subsystem exists to manage the storage of data on disk; it does not add value to an in-memory system.
  • Concurrency Management: The database must solve two concurrency problems. First, there exists a logical problem that multiple user transactions operating concurrently must not conflict, and must read consistent data. This is accomplished using high-level locks on resources like tables and rows, and tracking those locks across transactions. At the same time, the actual database software has multiple threads of execution. Data structures must be thread safe, which adds execution cost. For example, there is a large body of research on building concurrent B-Tree indexes, but the takeaway is that there is tremendous complexity for only moderate speed increases.

Most memory-centric systems address the page buffer, but few are able to drastically reduce concurrency costs. In fact, as other parts of the system get faster, the concurrency problem actually exacerbates; competing processes race to contention points.

The conclusion from the research was clear: to go 10X faster, the researchers needed a radical solution to concurrency. Read the paper (PDF) published at SIGMOD 2008 for more details.

Horizontal Scale Out

Beyond memory, the second major shift is the move to horizontal, scale-out systems. Lately, Moore's law has been used to give us more cores, rather than faster single-core performance. Clustering on commodity hardware is expected and hardware fault-tolerance is table stakes.

Horizontal scaling has been very difficult for traditional RDBMSs. Many are of the opinion that the relational model can't scale. Data movement is an issue; executing arbitrary joins can mean tremendous data movement, and commodity networking can present bottlenecks. Moreover, as locks are held over network round trips, concurrency becomes an even bigger problem.

But the team had hope. The C-Store system showed the relational model could be scaled for analytical SQL. The researchers conjectured that an architecture with the same domain-specific assumptions would work for operational stores as well. VoltDB was born.

VoltDB: The Big Ideas

One Thread - One Task at a Time

There are many ways to make the cost of concurrency lower, but few met the team's needs. To go 10X faster, concurrency costs needed to be all but eliminated. Since there's only one way to do that, the choice was obvious. 

All data operations in VoltDB are single-threaded, running each operation to completion before starting the next. Fancy concurrent B-Trees and lockless data structures are replaced by simple data structures with no thread safety or concurrent access costs. Not only does this meet speed goals, it's actually much simpler to build, test and modify, a double-win. 

Note that this choice is only really possible with a memory-centric design. The latency of reading and writing from disk is often hidden by shared memory multi-threading, but when running single-threaded, that latency will cause the CPU to spend most of its time idle. Eliminating disk waits allowed us to keep the CPU saturated, and run rings around traditional architectures.

No Waiting on Users

As an operational store, the VoltDB "operations" in question are actually full ACID transactions, with multiple rounds of reads, writes and conditional logic. If the system is going to run transactions to completion, one after another, disk latency isn't the only stall that must be eliminated; it is also necessary to eliminate waiting on the user mid-transaction.

That means external transaction control is out - no stopping a transaction in the middle to make a network round-trip to the client for the next action. The team made a decision to move logic server-side and use stored procedures.

There are some drawbacks here. External transaction control can make certain problems easier. Splitting business logic between the client and server requires additional organization. Also, most object-relational-mappers (ORMs) rely on external transaction control.

There are no perfect answers to these questions. The good news is these problems apply to any system designed to scale. One of the common ways to make any RDBMS faster is to reduce contention by eliminating external transaction control. At scale, the system shouldn't check in with the client over a network while it's holding locks. And while there are many criticisms of stored procedures, VoltDB addresses many of them. Rather than product-specific SQL-derived languages, VoltDB uses Java as its procedure language. Java is fast, familiar, and has a huge ecosystem. It's even possible to step through stored procedure code in an Eclipse IDE.

Much has been written about ORMs, which are famous for impeding scale. They make it easy to build large complex applications that work for low to moderate throughput, but quickly become the point of contention at scale. 

At a fundamental level, stored procedures move code to the data, and external transaction control moves data to the code. One of these things is much bigger than the other, so that's why VoltDB was designed to move code, not data.

Concurrency through Scheduling, Not Shared Memory

The team building VoltDB solved the concurrency problem by doing one thing at a time using a single-threaded pipeline. But how does that reconcile with multi-core hardware?

First, the decision was made to have VoltDB scale to multiple machines. The team needed to figure out a way to manage multiple single-threaded pipelines across a cluster. To deal with multi-core, why not simply move the abstraction? Treat four machines with four cores each as sixteen machines. Each core gets a single-threaded pipeline, and build the system to manage them in aggregate. The database thus can scale to lots of cores, lots of nodes or both. All the while, there are no concurrent data structures, locks or other expensive synchronization in the data management layer. 

The next goal was to keep the pipelines full. VoltDB does this by partitioning the data. Say a finance app is partitioned by ticker symbol; now all operations on "MSFT" can be directed toward the appropriate pipeline. If an app is partitioned by customer, an operation on a specific customer's data can be routed easily to the right pipeline. It's also possible to operate on data across pipelines by breaking the work up into per-pipeline chunks and processing per-pipeline results into global results. For more details on how transactions work in a partitioned environment, see How VoltDB does Transactions

VoltDB calls this approach "concurrency through scheduling." There are still threads and locks in the system, mostly to support networking and scheduling transactions, but all SQL and Java procedure logic is single-threaded and blindingly fast. The locks are there to schedule things that are low-contention; they protect small scheduling data structures, rather than user data for a transaction.

Modern Disk Persistence

While many data sets fit in memory, there is an impermanence to memory that makes belt-and-suspenders engineering types nervous. VoltDB needed disk persistence to be ready for the enterprise. But the team had to determine how to use disks without crippling performance. They asked: What does a modern approach look like? 

For starters, VoltDB barely reads from disk at all, so much of the workload of a traditional system is removed. Second, VoltDB disk IO is almost 100% append-only streaming writes. Even spinning disks can sustain high write throughput when used this way. Third, disk IO is almost entirely parallel to the operational workload. The system is designed to almost never block on disk synchronization.

This is achieved through two mechanisms: background snapshots, and inter-snapshot logical logging. VoltDB background snapshots are transactional, and serialize data to disk at a single, logical point-in-time, cluster-wide. They proceed at the speed of disk; a slower disk will take fewer snapshots per hour. They also don't block ongoing operational work. 

Logical logging protects data that mutates between snapshots. A logical log of all write operations is streamed to disk. If an entire cluster fails, the most recent snapshot is reloaded, followed by a replay of the logical log to bring the cluster back to the point of failure. This logical log has a huge throughput advantage over binary logs, partly because it is bounded in size, but mostly because disk IO can begin before the operational work is started. Binary logs must wait until an operation has completed to log to disk. Combined with a group commit mechanism, VoltDB's logging is remarkably low impact, allowing millions of writes per second, per node with synchronous disk persistence.

High Availability & Leveraging Determinism

While disk persistence is great, in a parallel, shared-nothing cluster, it's imperative to keep data in more than one place. High availability is the promise that VoltDB will continue to run in the face of many common hardware and network failures. Through replication, VoltDB achieves high availability and additional data safety.

The typical method of replication in shared-nothing clusters is to use a master node and one or more replica nodes. Changes are first applied to the master, then synchronously or asynchronously sent to the replica(s). This approach has several downsides. If replication is synchronous, additional latency costs are added by replication. If replication is asynchronous, data may be lost when the master fails. 

VoltDB uses synchronous replication without the latency overhead. It does this by being a little unconventional. Since all operations (transactions) in VoltDB are self-contained and deterministic, it's enough to simply replicate an ordered list of operations to do, and let each replica do the same work. All replicas do the same work in the same order at the same time in parallel. Responses are sent to the client when all replicas confirm completion. VoltDB ensures that operations are deterministic and replica state is identical in several ways, including distributed hashing of data mutation. 

Replica mastership merely implies a replica within a set of replicas is responsible for ordering deterministic writes. Node failure will trigger a fault detection and resolution mechanism that re-elects masters for all replica sets. There is a cost to replication in terms of duplicated hardware, but almost zero cost in terms of latency, and throughput may actually increase.


Hopefully we've answered the big questions, but if this has piqued your interest, here are some great resources to continue learning about VoltDB and Fast Data:

Don't Miss:

Developer Resources:

There are numerous resources available to developers.

Get Connected:

  • Developer Central

    One centralized place with all developer resources. Go There

  • A Look at a VoltDB Sample App

    In this blog, John Hugg walks us through a sample app in VoltDB. Read More

  • How VoltDB Works

    Take a simple dive into the VoltDB structure. Read More

  • Build a Sample App

    After Downloading VoltDB, here's a tutorial in building a sample app. Dive In


Get Started Today

It shouldn't take weeks to begin building blazing apps with real-time personalization and fast transactions. Developers: Download VoltDB and spin through our Quick Start Guide in less than 30 minutes.

Download & Quick Start