Connecting VoltDB to Amazon Kinesis streams
I work on the VoltDB Scrum team tasked to create interfaces – both import and export – to many other applications and systems in the broader ecosystem of big and fast data. It’s an always-interesting place to be since we’re in the center of the very rapidly evolving world of open source and private platforms, applications and solutions.
Today I’m writing about a recent integration, released in VoltDB v6.6, that allows system builders to connect VoltDB in either data producer and data consumer roles with Amazon (AWS) Kinesis streaming services.
This is another step in our continuing commitment to build and support connections to a variety of producers and consumers in the big and fast data world. That includes connectors to Hadoop, Kakfa, RabbitMQ, and Elasticsearch, as well as to generic interfaces like JDBC to connect to other database systems, HTTP, and file import and streaming export (CSV, TSV).
The goals are consistent – easy to set up and use, high performance, and the no-compromises strong ACID compliance that’s at the core of VoltDB.
Streaming to Amazon Kinesis Firehose
For the first case, VoltDB is the data source with respect to Amazon Kinesis. Let’s imagine a use case where client applications load high frequency stock trade records into a VoltDB cluster for validation and aggregation. Since VoltDB is an in-memory database, we set up time windowing procedures to insert rows into streams (export tables) and delete the corresponding rows from the in-memory tables.
The plumbing to transmit streaming data from VoltDB to Amazon is the VoltDB Kinesis Firehose Export Conduit, freely available at https://github.com/VoltDB/export-kinesis.
A stream in VoltDB can be thought of as a virtual table. This table doesn’t hold state; rather, it defines an outbound schema that can be consumed, e.g. streamed, by an external system. This enables a VoltDB application to easily, and transactionally, export data from VoltDB to another system – in this example, Kinesis – but also to many other systems such as Kafka, HDFS, relational databases, etc.
Here’s an example of a simple stream definition:
In Kinesis Firehose, we’ve already created the stream “streamdata” and configured its handling in the world of Amazon Services. This involves creating the delivery stream, directing the data flow through Amazon S3, and then specifying that the rows of data are inserted into Amazon Redshift, a Postgres-like database for persistent storage hosted on an EC2 cluster.
In “demo” applications in AWS, the default data transfer rate to Amazon using the Kinesis Firehose service is quite limited – default is 2,000 transactions per second or 5,000 records per second or 5MB of data per second. However, to support our development and testing at typical VoltDB transaction/second rates, the Amazon Kinesis team graciously provisioned much higher transfer rates, and likewise will work with business teams to support a wide range of data transfer requirements.
With provisioning for better performance from Amazon, our export streaming speed exceeded 28,000 rows per second. That’s still slower than VoltDB’s transactional capability, but as mentioned above, the database buffers rows to disk and sends them to the Amazon consumer at its maximum consumption rate.
Behind the scenes in VoltDB, the streaming service is capable of buffering data on disk to match slower consumers without reducing primary database performance. In a VoltDB application, rows are inserted into the stream, just as you would insert data into a relational table. In the background the VoltDB export stream manager delivers the rows to the stream consumer or consumers at their maximum rate. This is handled with VoltDB reliability.
Streaming to VoltDB from Amazon Kinesis Streams
In the second case, VoltDB is the data consumer, ingesting rows of data from Amazon Kinesis streams and inserting them into database tables transactionally. This data can be ingested using default VoltDB insert transactions, or the application can specify custom business logic via a transactional Stored Procedure, for example to handle validation, data cleansing, aggregation, or routing.
In this example use case, Amazon Kinesis streams consume click data from numerous geographically-distributed web servers. VoltDB consumes this stream data, validating and “sessionizing” the data using stored procedures and inserting the processed rows into appropriate tables and materialized views for aggregation and real-time analysis. It’s easy to visualize output streaming as well, since it’s possible there’s a downstream data warehouse as the next stop.
We specify the VoltDB consumer connection to the Amazon Kinesis stream in the “import” section of the database deployment file:
Note the “procedure” property in the deployment file. This is the stored procedure that processes the incoming data. VoltDB calls the procedure for each incoming row. The procedure can validate the data, query other tables to “sessionize,” and insert the row into a database table. This happens as a transaction, atomic by definition, so if there are problems or errors, all the processing will be rolled back.
VoltDB automatically creates “default” procedures, including a simple INSERT. If the goal is to get the row into a table as fast as possible, it can be done without any coding at all.
We’ve seen how VoltDB can consume – import – streaming data from Amazon Kinesis and it can produce – export – streaming data to Amazon Kinesis, or both if that’s the application design requirement.
It is easy to import or stream export data with no application coding at all. If complex processing and aggregation are required, that’s straightforward too, as the example in this post illustrates.
Give it a try today and let us know what you think. Download VoltDB here.
Amazon Kinesis, both Streams and Firehose: https://aws.amazon.com/kinesis/