Fast Streaming Data

The Challenge of Fast Streaming Data

Data is fast before it’s big. Big data is data at rest, while fast data is in motion. Big data is the historical data businesses collect about customers, operations and events or interactions with their customers. Fast data is live streaming data that provides information about the person or process that generated it -- it comes from real-time customer interactions or operations.

When we talk about fast data, we're not measuring volume in the typical gigabytes, terabytes, and petabytes common to data warehouses. We're measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day.

Fast data means velocity as well as volume -- thousands of events per second, millions of transactions per hour -- which gets to the core of the difference between big data and fast data. The challenge for today's businesses is to capture intelligence from streaming data while it's still live, before it ages and flows into the big data "lake."

The Fast Data Difference

Fast Data is different from Big Data and has different requirements. It's based on a different technology stack that has the ability to analyze, decide, act on and extract value -- recommendations, decisions, and actions -- as fast as data arrives, typically in milliseconds.

Fast streaming data is generated by thousands of unique data sources (people, smartphones, sensors), contributing data at high velocity, in high volume. It contains valuable potential insights and can be augmented in real time with contextual information, but these insights are perishable and the opportunity to act on them is lost when the moment passes.

Today's applications need a fast data stack that isn’t built just to capture and pipe streaming data, but also to enrich, add context, personalize, and act on it before it becomes data at rest. These high-velocity applications require the ability to analyze and transact on streaming data.

Batch vs. Continuous Processing: Which is Best?

Batch has been the prevailing approach for processing big data for years. It’s an efficient way of processing large volumes of data: you collect, process, and then report. But while batch has gotten faster it’s not real-time, and falls short of what’s needed for fast data applications. If you want to grab real-time data and output recommendations, decisions, and analyses in milliseconds you need a different approach.

Fast data applications continuously ingest, analyze and make decisions/take action on each event as it flows through the system. The benefit of this approach is that incoming data is processed in real time on a per-person or per-event basis, and applications can deliver richer, more individualized interactions. The data is eventually exported to a long-term data store.

Batch processing has its place, but for real-time analytics and action “in the moment,” continuous processing is often a superior approach.

Advances with in-memory operational databases and high-speed data ingestion/export technologies make it easier and more practical to build fast data applications than ever before.

Operation Batch Processing Continuous Processing
Type of Data Big Data (at rest) Fast Data (in motion)
Data Analyzed Large historical data sets Incremental analysis of new events
Latency Minutes or more Milliseconds
Type of Applications Reporting, Business Intelligence Operational, Mission Critical, Real-Time
ipad-1.png

Narrow the Field of Database Choices

Our two minute assessment will help narrow your choices to the right technology for your next application. Maybe it's VoltDB. Maybe not.

Launch Survey

Understanding the Fast Data Stack

The most common use cases for fast data applications fall into four areas:

icon-14.png

Real-time recommendation engines or hyper-personalization applications that can detect and act on individual customer needs in real-time

icon-15.png

Real-time, "down to the last dollar" resource management applications, e.g., bid and order management

icon-16.png

Per-event analytics with automated decision making to enforce a policy or authorization level, e.g., credit card fraud, managing API calls and authorizations in real time

icon-17.png

Sensor data management in Internet of Things (IoT) applications

These use cases require a fast data stack that performs four functions: ingest, analyze, decide/act, and export.

Ingest

Data ingestion is the first stage in the fast data stack. The job of ingestion is to interface to the streaming data sources, and to accept and transform or normalize incoming data. Ingestion marks the first point at which data can be transacted against, applying key functions and processes to produce value from the data, including insight, intelligence, and action.

Analyze

As data is ingested, it is used by one or more analytic and decision engines to accomplish specific tasks on the streaming data. The challenge for the analysis and decision making portion of the fast data stack is to keep pace with the velocity of the data stream.

Streaming analytics need to consume high-velocity data while maintaining real-time analytics, in the form of counters, aggregations and leaderboards.

Decide & Act

Real-time decisions are used to influence the next step of processing. Real-time decision engines are doing a lot of work, in that they consume the velocity of the data stream and at the same time are processing complex logic, all in time to complete the real-time decision feedback loop.

When you look at these requirements, you're probably looking at a transaction processing (OLTP) problem. OLTP problems are often solved with databases, but streaming problems may seem ill-suited to such an approach because of the limitations of traditional database technologies.

Unlike traditional databases, new in-memory OLTP databases like VoltDB can process streams of data and produce analyses and decisions in milliseconds. As a single integrated platform, an in-memory OLTP database reduces the complexity of building fast data applications by eliminating the need to connect streaming systems and non-relational data stores, and also provides a familiar, proven interaction model (SQL), simplifying application development and capturing real-time analytics using industry standard SQL-based tools.

Data Export

Once fast data analytics are completed, the data moves through the pipeline for later processing; data ingestion and export flow at the same rate. Usually streaming analytics rely on an ingestion queue. Similar queues are used for the export stage. For real-time decisions, which process fast data in a continuous query mode, an Export function is needed to transform and distribute data to the Big Data warehouse/storage (OLAP) engine(s) of choice.

Apache Storm, Apache Kafka, Apache Spark Streaming, Amazon Kinesis

Four popular streaming systems have emerged over the past few years: Apache Storm, Apache Kafka, Apache Spark Streaming, and Amazon Kinesis.

Originally developed by the engineering team at Twitter, Apache Storm can reliably process unbounded streams of data at rates of millions of messages per second. Apache Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.

Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It's sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs.

Apache Spark is a popular data processing framework for Hadoop. Apache Spark Streaming is a way of using Spark for streaming analytics against micro-batches of streaming data. Spark Streaming is sometimes used to batch, sessionize or otherwise transform real time data. At its core it’s still batch processing, albeit micro-batches.

Amazon Kinesis is similar to the Apache Kafka message queue. It was created for rapid ingestion of streams of data. It supports data ingestion and export.

At their core, these four data ingestion alternatives deliver messages but don’t support transactions, stateful operations or querying. If you want to do analytics and make decisions and take actions you’ll need to integrate additional multiple components or use a fast operational database.

Putting Fast Streaming Data to Work

As the volume and velocity of data grows, so do the challenges of building fast data applications. The fast data stack is emerging across both verticals and industries alike for building applications that process these high velocity streams of data that quickly accumulate in a big data lake.

This new stack, the fast data stack, has a unique purpose: to grab real-time data and output recommendations, decisions and analyses in milliseconds. Over the next several years this emerging fast data stack will gain prominence and serve as a starting point for developers writing applications for streaming data.

An ACID compliant operational database like VoltDB, in combination with message queues like Kafka or Kinesis for data ingestion and export, can process each incoming event or request as a discrete transaction for analytics, decisions and real-time action or interaction.

How VoltDB Solves the Fast Data Challenge

How is it that a fast operational database like VoltDB is so well suited to the fast streaming data challenge? There’s really nothing quite like VoltDB's architecture or product.

VoltDB is a unique combination of a fast in-memory OLTP database that supports multi-source ingestion and multi-target export to perform analytics on incoming streams of data and manage large volumes of transactions on live data, all in real time.

VoltDB is the commercial implementation of H-Store, a fast, natively scalable, fault-tolerant, transactional database designed by Dr. Michael Stonebraker and a team of senior computer scientists from MIT, Yale University, and Brown University. Read the original H-Store paper here: http://hstore.cs.brown.edu/papers/hstore-demo.pdf

icon-1.png

Try VoltDB:

It shouldn't take weeks to begin building blazing apps with real-time personalization and fast transactions. Developers: Download VoltDB and spin through our Quick Start Guide in less than 30 minutes.

Get Started