16 FAQs on Machine Learning with True Real-Time Decisioning
Recently, we held a webinar, hosted by one of our engineers, entitled “Discussion & Demo — Machine Learning with True Real-Time Decisioning”. In it, we explored how intelligent real-time decisioning technology in machine learning helps you operationalize your analytics and take the next critical step in realizing your machine learning goals. It also featured a live demonstration showcasing machine learning for fraud prevention using VoltDB’s real-time data platform.
We had a number of excellent questions during that session, which we thought might be helpful to share for those who didn’t attend (or for those who want a second look). These questions are reproduced and answered below. For those who missed the webinar, you may view it on demand.
Editor’s Note: Unless otherwise noted, the question is coming from a webinar attendee via our chat functionality during the session. Answers were provided by VoltDB’s Doug Jauregui, Senior Solutions Engineer. These answers have been edited for clarity and grammar.
Q: Does VoltDB integrate with APIs?
A: Yes. So, we have a series of different SDKs and different languages; .NET, Java, PHP, Python, C++, that are able to integrate for any of the applications that you need to develop.
Q: How does the solution integrate with Hadoop and real-time data sources?
A: So we have our connectors that are able to connect directly and natively to Hadoop to be able to pull things out depending on the data storage that you may have. There might be something like serializing the JSON data and then having that be converted into rows and columns and rows, or if you’re leveraging VoltDB as a document store then that becomes a lot easier because then, at that point, you were storing the same payload or same serialized data into VoltDB. So connectors are the way – we have a Hadoop connector that allows you to do that in a synchronous or asynchronous fashion.
Q: Does PMML/JPMML support any model, and does VoltDB support all the models?
A: We do support all the mathematical model pipes available within the PMML standard, and as long as that PMML is adhering to the standards defined, we are able to leverage and load those models up. Not only are you able to do it the way I showed you through code, but there is actually a command line interface in which you’re able to just simply do a command where you’re loading the PMML file, and it does all this code stuff that I showed for you. Then, at that point, you just able to leverage it and use it within your core procedures.
Q: What are the requirements to run VoltDB?
A: The hardware requirements are dependent upon the sizing of your project, but our base reference architecture is a three-node setup. So, this means three different environments have to be created so that you’re able to leverage the best of high availability or persistence — what we call a replica factor or a k-safety — and be able to set these up. In terms of facts, we start off recommending for production at least a quad core or eight core. Again, depending on the metrics that you need to obtain and your latencies, those node counts may increase.
Q: Does VoltDB have any in-cloud/SaaS version?
A: Yes, we do. We have our partnership with Amazon, so we’re actually on the Amazon Marketplace. Both of which you can either set it up on the marketplace and pay a transactional costs or BYOL (Bring Your Own License), and load it up on Amazon as well.
Q: Can queries be written dynamically by the end user on the data?
A: Yes, absolutely. We actually have a couple use cases in which a lot of the folks that we had, that had learned Oracle PL/SQL, wanted to be able to leverage that knowledge and have that information be seamlessly passed down. So, there were customizations made that allowed us to be able to execute these PL/SQL statements and reconvert them into VoltDB queries and VoltDB stored procedures that could then be registered. So that way you were able to dynamically update VoltDB without having to actually touch the system.
Q: How does VoltDB help speed up the time to deploy models to production?
A: VoltDB, one of its features is the ability to take that PMML file as we saw on the demonstration, and actually load that in side-by-side in a stored procedure, and be able to do that without having to go offline at any point in time. There is no other platform for real-time data and real-time decisions that allows you to seamlessly be able to do so, and to be able to continuously repeat that process in a scalable fashion and without going down.
Q: Do we have the documentation on generating JPPML and how to integrate it to stored procedures?
A: So, generating the PMML, there are different ways that you are able to do so. You can leverage Spark ML libraries and use things like PySpark so that you can actually execute your formula via Python and then stage that formula into a pipeline and then fit it. You saw the list of the different mathematical model types; you’re able to use any one of those within PySpark or Python code and be able to fit that in and execute the generation of your pipeline that gets staged and then fit it into an XML model. Then, once you have that, you are able to load that file into VoltDB, where there is a Java PMML engine that’s part of VoltDB that actually instantiates or loads that model into memory so that it can be leveraged and used as a user defined function, for example.
Q: Does VoltDB provide support for data collections or data analysis?
A: We do have support for the data analysis. Absolutely. In regards to the data collection, I would need to get a little bit more clarification from the individual who asked the question, but we are able to integrate, for example, to BI tools with JDBC or ODBC and be able to actually do things like dashboarding or generate reports using those tools that are accessing VoltDB.
Q: Does VoltDB support JSON data format?
A: Yes, it does. If you want to leverage VoltDB as a caching solution, as an entry point, say for read-only use cases, then you are able to configure it so that VoltDB is a key value store but yet, you have all of the great benefits of a relational database system. That means you have the acquiring capability, you have the ability to use the Java Store procedures and you have, if you need it, asset capability; not just against one document which is NoSQL’s limitation but against thousands of documents, but with the strictest isolation, strictest asset compliance.
(Editor’s Note: If you’d like to see an example of using JSON natively in VoltDB, check out this example on GitHub).
Q: What database is VoltDB built on?
A: VoltDB is a custom database that was designed and founded by Michael Stonebraker. It’s designed to be able to be this in-memory system that allows you to have automatic sharding of your data or distribution, and partitioning that data against tables to maximize the system’s ability to parallelize the querying of that data. So, it is a proprietary system but we do provide an open source version of VoltDB.
Q: How is the model loaded by VoltDB? Does every site receive its own in-memory and copy or is there one shared copy per node?
A: Great question. Today, you are able to employ with whichever way you would like. You could store it as a single copy on a SAN, on a physical storage system, so that it can be loaded that one time. So yes, there would be perhaps a slightly additional payload hit in loading that up at one time. Once it’s in memory, it stays in memory for the duration of the entire session that that VoltDB cluster is running.
Alternatively, if you want to, you’re able to then load it into a table as a key value store, so your PMML is actually stored in the table, and yes, it would then, at that point, be replicated into every node so that it’s always available as each node would need to have access to that PMML data. Especially if you are doing refreshes of that PMML file on a continual basis. So, in the case of financial services, they may have hundreds of features defined and they may be adjusting certain criteria of those features, or the data itself is changing so rapidly, that they need to deploy out PMML files perhaps as quickly as hourly.
Q: Referring to the CAP theorem, Where does VoltDB stand practically: CA, CP or AP?
A: Regarding the CAP theorem, in an intra-cluster scenario, that means one cluster, VoltDB would be able to have a consistency and a partitioning. Now, if you want to have availability as well and petitioning, then this is where you’re able to then leverage cross data center replication. Like a lot of the different NoSQL technologies, we also have our ability to have the full-fledge of the CAP theorem. That means you’re able to have consistency in partitioning in one cluster, but then have your availability against multiple clusters that can be deployed in different data centers. This gives you the ability to have, yes, an eventual consistent model between two clusters but you will have availability built in as well. So, this allows you to be able to fulfill all three different capabilities of the CAP theorem, but deleveraging multiple clusters.
Q: How does VoltDB differentiate from other real-time database providers?
A: So VoltDB is in-memory. We do persist or we have snapshots of our data that can be configured and generated. We can actually have immediate consistency every millisecond if need be. We have an asynchronous and synchronous capability to retrieve data and load data. We are a shared nothing architecture. We do shard, so this allows us to distribute and organize a deployment out based on nodes and having multiple nodes to be able to handle the throughput.
But at the end of the day, we are designed to be able to do true real time decisions.
This means operational level solutions in which you are needing to have speed, throughput, and something that is doing compute as quickly as possible, and then be able to natively connect to a middleware stack of software like Kafka, Hadoop and so on, to be able to move that information somewhere in real time. The last part of that is key. The old methodology of storing query is going to quickly go away for these real-time, or true real-time use cases where you’re actually going to need to stream, make a decision, execute something on that decision. If I need to roll back then I need to have that ability, right? The full asset capability, but then be able to then move that information – that decision – forward somewhere. That’s how VoltDB is able to separate itself from everybody else in the pack. The true real-time decisioning capabilities and configured in different ways, right? Whether it’s a document store or a schema-bound relational model.
Q: Is there a functional difference between the community edition and enterprise edition?
A: There are some differences relating to cross data center replication and partitioning strategies so that you get much more robust capabilities with the enterprise version than you do with a community edition. In addition to the support, we’re able to provide immediate support capability with the enterprise edition.
(Editor’s Note: VoltDB also offers support for our Open Source Community Edition)
Q: Is VoltDB a database on its own?
A: Yes, VoltDB is a database on its own.