VoltDB is an in-memory database that excels at handling massive volumes of read and write operations in real-time.
However, performing high throughput database operations is often only one aspect of the larger business context where data needs to transition from system to system as part of an overall infrastructure. VoltDB provides powerful interoperability features that allow you to select, enrich and distribute data to downstream file systems and databases.
The target for exporting data from VoltDB may be another database, a repository (such as a sequential log file), or a process (such as a system monitor or accounting system). No matter what the target, VoltDB helps automate the process for you. This post explains how to plan for and implement the exporting of live data using VoltDB.
VoltDB lets you automate the export process by specifying certain tables in the schema as sources for export as part of the project definition file. At runtime, any data written to the specified tables is sent to the export connector, which manages the exchange of the updated information to a separate receiving application. The following diagram illustrates the basic structure of the export process, where Tables B and D are specified as export tables.
Note that you, as the application developer, do not need to modify the schema or the client application to turn exporting of live data on and off. The application’s stored procedures insert data into the export-only tables; but it is the deployment file that determines whether export actually occurs at runtime.
Exporting Live Data
When a stored procedure uses an SQL INSERT statement to write data into an export-only table, rather than storing that data in the database, it is handed off to the connector when the stored procedure successfully commits the transaction. Export-only tables have several important characteristics:
- Export-only tables let you limit the export to only the data that is required. For example, in the preceding example, Table B may contain a subset of columns from Table A. Whenever a new record is written to Table A, the corresponding columns can be written to Table B for export to the remote database.
- Export-only tables let you combine fields from several existing tables into a single exported table. This technique is particularly useful if your VoltDB database and the target of the export have different schemas. The export-only table can act as a transformation of VoltDB data to a representation of the target schema.
- Export-only tables let you control when data is exported. Again, in the previous example, Table D might be an export-only table that is an exact replica of Table C. However, the records in Table C are updated frequently. The client application can choose to copy records from Table C to Table D only when all of the updates are completed and the data is finalized, significantly reducing the amount of data that must pass through the connector.
Of course, there are restrictions to export-only tables. Since they have no storage associated with them, they are for INSERT only. Any attempt to SELECT, UPDATE, or DELETE export-only tables will result in an error when the project is compiled.
Planning your Export Strategy
The important point when planning to export data, is deciding:
- What data to export
- When to export the data
It is possible to export all of the data in a VoltDB database. You would do this by creating export-only replicas of all tables in the schema and writing to the export-only table whenever you insert into the normal table. However, this means the same number of transactions and volume of data that is being processed by VoltDB will be exported through the connector. There is a strong likelihood, given a high transaction volume, that the target database will not be able to keep up with the load VoltDB is handling. As a consequence you will usually want to be more selective about what data is exported when.
If you have an existing target database, the question of what data to export is likely decided for you (that is, you need to export the data matching the target’s schema). If you are defining both your VoltDB database and your target at the same time, you will need to think about what information is needed “downstream” and create the appropriate export-only tables within VoltDB.
The second consideration is when to export the data. For tables that are not updated frequently, inserting the data to a complementary export-only table whenever data is inserted into the real table is the easiest and most practical approach. For tables that are updated frequently (hundreds or thousands of times a second) you should consider writing a copy of the data to an export-only table at an appropriate milestone.
How Export Works
The export connector implements a loosely coupled approach to extracting export data from a running VoltDB database. When export is enabled at runtime:
- Insert operations to the database tables identified as export-only in the project definition file are queued to the export connector.
- An export receiver establishes a link to the connector through one of the standard TCP/IP ports (either the client or admin port). The receiver then issues POLL requests.
- The connector responds to the POLL requests with the next queued data block (or an empty block if the queue is empty).
- The receiver application is then responsible for receiving the data and writing it to the appropriate destination.
- Finally, the receiver sends an ACK message acknowledging completion of the export (at which point the connector can remove it from the queue) before polling for the next data block.
The following diagram shows the interaction between the VoltDB database, the connector, and the receiver application.
The export function queues and passes data to the connector automatically. You do not need to do anything explicitly to start the connector; it starts and stops when the database starts and stops. The connector and the receiver use a series of poll and ack requests to exchange the data over the TCP port.
The receiving application decides what is done with the data it receives from the connector. For example, one receiver might write the serialized data to a file, while another inserts it into an analytic database. Only one receiver can connect to the connector at a time. But it is possible to change the destination of the export by using different receivers.
It is also important to note that the receiver application must create connections to all nodes in the database cluster, since each node creates its own instance of the connector.
For the export process to work, it is important that the connector and receiver keep up with the queue of exported information. If too much data gets queued to the connector by the export function without being fetched by the receiver, the VoltDB server process consumes increasingly large amounts of memory.
If the receiver does not keep up with the connector and the data queue fills up, VoltDB starts writing overflow data in the export buffer to disk. This protects your database in several ways:
- If the receiver fails, writing to disk helps VoltDB avoid consuming too much memory while waiting for the receiver to restart.
- If the database is stopped, the export data is retained across sessions. When the database restarts and the receiver reconnects, the connector will retrieve the overflow data and reinsert it in the export queue.
It is important to note that VoltDB only uses the disk storage for overflow data. However, you can force VoltDB to write all queued export data to disk by either calling the @Quiesce system procedure or by requesting a blocking snapshot. This means it is possible to perform an orderly shutdown of a VoltDB database and ensure all data (including export data) is saved.
The Export-To-File Receiver is an application specific to the export destination, e.g. files on disk, a data warehouse, Hadoop, etc. When you start the export-to-file receiver, you specify one or more of the database cluster nodes for the receiver to query for information (using the –servers argument). The receiver queries these nodes, one at a time, until it receives a response. Part of the response it receives is a description of the cluster, including a list of nodes and available ports. The receiver then creates connections to every node in the cluster using a specified port (either the client port or the admin port).
Note that you don’t have to specify all of the nodes of the cluster on the command line. You only have to specify one. The receiver then discovers the cluster configuration from the first node it reaches. However, you can specify multiple nodes in case one or more of the nodes is unavailable when the receiver starts.
Once the receiver connects to the cluster, it starts to poll and ack for export data. The receiver “decodes” the export stream from its serialized form into the appropriate datatypes for each column in the table. It then writes the data out as either comma-separated or tab-separated files (one per table) “rolling” over to new files periodically.
If the receiver loses connection to any of the nodes in the cluster (either because of a node failure or a shutdown), it disconnects from the cluster and repeats the initial discovery process, using the information it collected from the original connection. In other words, it will query every node in the cluster, one at a time, until it determines the new configuration. If the receiver cannot reach any of the nodes (for example, if the cluster is temporarily shut down) it will periodically retry the discovery process until the database cluster comes back online.
Once the cluster comes back, the receiver resumes export operations, picking up with the last data packet it received prior to the interruption. This allows the export process to continue without operator intervention even across network disruptions, node failures, and database sessions.