Stream views — the untold story
Stream views are a recent addition to VoltDB. With stream views, we get the utility of the aggregations – real-time analytics – without the memory footprint of their source tables.
A stream view is a materialized view of data that’s inserted into your stream tables. With VoltDB stream views, the aggregations and other columns in the materialized view are updated with each SQL INSERT into the source tables, same as with any materialized view. But with streams, there’s no underlying, persistent set of rows, as there is with traditional views.
First, let’s explain streams with materialized views. While there are other uses of streams as connections to external data consumers, in this use case, inserting rows into the stream, as mentioned above, drives the aggregations and columns in the materialized view without any of the memory overhead of the source data. Streams are defined like any other table – each stream has a schema with column names and data types – and data is inserted by normal SQL statements or stored procedures. Insert a row into a VoltDB stream table and when the transaction is committed, the materialized view has been updated.
And consistent with VoltDB semantics, the stream manager keeps things reliable: if the consumer disappears due to error or link failure, the stream manager will resume with the first uncommitted – that is, undelivered – row when the consumer returns. If VoltDB stops for some reason, similarly, the stream manager will reconnect to the consumer and resume streaming.
Combine that stream with a materialized view and we have a stream view. A materialized view is typically an aggregation of data from one or more tables. Aggregate functions include minimum, maximum, counts, sums, averages – held in other tables. VoltDB provides the convenience of updating the aggregations as rows of data as the underlying sources are inserted, updated or deleted. Stream views maintain the same aggregations, but the underlying data source is ephemeral. So as noted above, we get the utility of the aggregations – real-time analytics – without the memory footprint of the source.
Putting the stream and the view together is quite simple. See the “auction” project in VoltDB documentation for a working example. We won’t dissect that entire project here but will look at the stream and its view. Here is the underlying stream table.
We specify the destination for the rows inserted into BID_STREAM by “TARGET”, in this case, “newfile”. That connection is completed in the VoltDB deployment file “export” section.
In this case, the data goes to a file in “csv” format. It’s important to understand that the data doesn’t have to “go” anywhere. If there’s not a “target” in the stream definition, or no export section in the deployment file, the stream materialized view still maintains its aggregate view of the data that’s been inserted into its source table or tables, if the populating SQL “joins” tables.
The final piece in the stream view is the definition of the materialized view itself:
The stream view maintains the auction item identifier, the number of bids for it and highest and lowest prices.
The stream view has one row that aggregates the “action” for each auction item, so we’re retaining the salient data for each item without saving all the bids in its source table. So a simple select shows the current status of all the auctions:
This stream view is pretty minimal. All the power of SQL is available so if we want the item name rather than its number, we create a join with the item table:
To recap, VoltDB stream views allow you to maintain real-time aggregations on data that has been streamed out of VoltDB. But with stream views, there’s no underlying, persistent set of rows, as there is with traditional views. This is a great way to maintain “historical” aggregations without the overhead of traditional materialized views and without having to reference an external system for that information.
Download VoltDB today, give stream views a try, and let us know what you think!