FoundationDB’s Lesson: A Fast Key-Value Store is Not Enough
The sale and subsequent closure of FoundationDB cut short something of a grand experiment. FoundationDB, conceived as a Key-Value store, had decided to add flexibility in the form of programming and query-model “Layers” on top of its core KV store. First up was SQL, software that ran on top of core FoundationDB and provided SQL relations, indexes and queries. A graph interface and possibly other “Layers” would follow.
So how did the SQL system work out? Running sysbench, FDB-SQL was less than half as fast as MySQL on a single machine. This is by their own measurements. They do claim their system scales well as you add more machines, but crucially, they don’t give actual numbers for the distributed SQL performance on sysbench, just “normalized performance.” I’d bet some money that absolute performance was not good. If the network overhead of the loopback interface was bottlenecking the single-machine test, then an actual network with actual latency couldn’t have helped.
The lesson here circles back to the failed search for one-size-fits-all architectures. The architecture that made FoundationDB a good ACID-transactional key-value store for operations on small numbers of keys is the wrong architecture to implement an operational SQL store.
This came up the other day in a discussion about the FDB sale on Hacker News. I’m just going to quote my little rant below; the comments that prompted it are in italics.
“Sure, but isn’t that basically what everyone does? All of the relational databases I know use ordered key-value storage engines. FoundationDB is the same, except it’s distributed.”
“The point is to use FDB as a foundation. I heard how one of the FDB founders replaced SQLite’s B-tree storage engine with FoundationDB, which turned SQLite into a distributed SQL database. It’s much more difficult to make something like that if you had to tackle the hard distributed systems problems.”
So this is a really interesting point and I can see why it makes sense if you’ve not built a SQL engine. Why is building SQL on top of a KV store the wrong call? What’s the difference between MySQL’s or Postgres’s or VoltDB’s “storage engine” and what FDB had built?
First, I’m not claiming that putting SQL on top of a KV store like FDB is impossible, just that you’re going to have to either compromise the purity of the KV engine substantially, or you’re going to get slow SQL for anything other than single row CRUD.
It starts with metadata. FDB-SQL stores metadata in the KV store itself. This is great from a distributed correctness point-of-view. It’s also great from a simplicity point-of-view. If I trust the underlying system to be safe and consistent, then my metadata is also safe and consistent. But now I need to do a ton of reads before I run my SQL to know how to run my SQL. Where’s my data, for example? Compare this to VoltDB, which replicates metadata to all processing sites, and basically has a second state-machine to ensure that each processing site has the right metadata at the right time. Updating metadata is more expensive, but the fast-path of using the metadata is several orders of magnitude faster.
Then you get to locking. I believe FDB-KV uses per-key read-write-sets to manage concurrency. This choice makes their two-key transaction benchmarks look great, but it scales poorly as the number of keys per transaction grows. SQL basically begs users to write operations that read and write to lots of keys. To make scans fast (even common partial-index-scans), you’re going to need more granular locks, or you’re going to need to relax consistency (ACID, not CAP).
Now we get to storage efficiency. If I have a SQL relation with a primary key and other columns, do I split it so the primary key is in the “key” and the other columns are in the “value” in the KV store? Do I store the whole record in the “value” and lose some efficiency? What about relations with no primary key? Say I’ve got a table that facilitates a many-to-many relationship and it’s just a set of integer pairs. How do I store that efficiently in a KV store? And what about that pesky metadata? Does the key need to include the identifier of the SQL relation (table)? Of course it does.
Secondary indexes. Oh man. To do them well on top of a KV store, you’re going to need pretty strong consistency to ensure that index records point to the right base record, and that there are no base records without an index record. This rules out the aforementioned trick of relaxing consistency to get faster many-key transactions. It’s also a metadata problem; I’ve got metadata I need to read to understand how to use the secondary index. More than that, when I update a record in the base relation, I’m going to need to find all affected secondary indexes and make sure they reflect the new information. That might mean lots of reads to the system to query metadata to update the secondary indexes. And again, we have the efficiency problem where I have to put a bit of extra stuff in each “key” to identify the secondary index the KV pair belongs to. Secondary indexes have locking/consistency, storage efficiency and metadata problems in this model.
Finally you get data transfer. This is probably not a broadly applicable problem, but I think in the FDB implementation, there is a lot of needless data transfer between KV cluster nodes and SQL processing nodes. Too much data needs to be collected and processed, rather than pushing down that processing to the data’s resting location. In FDB-SQL, if I just want one value from a large record, do I have to move the whole record over the network? Most SQL systems build processing DAGs for each SQL statement and can in-process stream between nodes, or have efficient temp tables to buffer intermediate results.
This falls out of making the SQL layer so separate. In fact, I think the FDB team working on SQL was almost a completely separate team in Massachusetts than the KV team in Virginia. If the SQL layer worked, then I can see how awesome this would be. The “Layers” model sounds so appealing, but it’s technically quite hard.
So yes, I agree that under most SQL-relational systems, there is a storage engine that smells much like a KV store. Still, that system is much less “pure” than what they were going for at FDB. Locking, metadata, secondary indexes, native understanding of relations, moving the processing to the data — all are critical to do right to get reasonable performance.
If FDB had more time to continue on the SQL path, I imagine it would dictate much deeper hooks into the KV side of things. I’m sure with time it could get a lot better. There’s a lot of research addressing some of the tradeoffs I’ve mentioned above, especially in the distributed transaction coordination space. Still, it’s always going to be easier to build the storage engine for the kinds of operations you want from day one.
I think the interesting thing here is that FoundationDB saw what VoltDB has seen over the past five or six years: fast data mutation and lookup-by-key are important, but they’re not enough. At high mutation rates, users want visibility into how their data is changing and how their business is affected, usually in real-time. A fast key-value store – even one with strong ACID properties – is all well and good, but without a powerful query language and real index support, it’s much less interesting. Quoting the Google F1 Paper:
“Features like indexes and ad hoc query are not just nice to have, but absolute requirements for our business.”
FoundationDB saw this coming and bought Akiban to add a SQL layer. But bolting SQL onto a KV store, even a really good one, isn’t trivial to do. It was released behind schedule and didn’t live up to expectations.
Cassandra is an interesting parallel. They also saw the importance of powerful queries and have added CQL to their system to support a SQL-esque query layer. Unlike FDB-SQL, CQL is much more natively integrated into Cassandra. It’s still a much less powerful query system than a proper relational database with real relationships, secondary indexes and such. But it seems like it’s close enough to complement Cassandra’s other strengths.
I’ll close with congratulations to the FoundationDB team. I’ve met some excellent engineers on their team and wish them the best in their new roles.