Testing VoltDB Against PostgreSQL
A recent blog post described SQLCoverage, a VoltDB-developed Python tool that tests VoltDB by running millions of queries using randomized data and comparing the results with the those of identical queries run against HSQLDB, an open-source relational database. This blog post continues that discussion by describing recent extensions of SQLCoverage to compare VoltDB query results against PostgreSQL, including testing Geospatial query results against PostgreSQL’s PostGIS extension.
Testing VoltDB Against PostgreSQL
As the column types and query types supported by VoltDB expand, we need to be able to test SQL queries that are not supported by HSQLDB. A major example of this is the Geospatial features added in version 6.0. We chose PostgreSQL, aka Postgres, to compare against for its wide array of features. PostgreSQL is a widely-used open source relational database originally developed by Michael Stonebraker, co-founder of VoltDB.
In October 2015 I began modifying the SQLCoverage test program to allow it to compare VoltDB query results against PostgreSQL query results – while still retaining the option to compare against HSQLDB. The main challenge in this was that certain types of queries return slightly different results from VoltDB and PostgreSQL. One example is the treatment of NULL values in a query that uses an ORDER BY clause. VoltDB treats NULL values as if they were the smallest values, so they are always sorted first, unless the DESC keyword is used, in which case they are always sorted last. PostgreSQL, by default, does the opposite. Either approach is consistent with the SQL standard. For example, if you run this query (taken from the basic-ints test suite) in VoltDB:
SELECT ID, BIG B16 FROM P1 A ORDER BY ID, 2 DESC
and then you run it in PostgreSQL, you will get different results if any NULL values are present. In order to get the same result from PostgreSQL, we instead run this query:
SELECT ID, BIG B16 FROM P1 A ORDER BY ID NULLS FIRST, 2 DESC NULLS LAST
If ordering of NULL values had been the only difference between VoltDB and PostgreSQL, we could have used the normalizer, as described in the recent post on SQLCoverage with HSQLDB; however, since it was one of several issues, we treated it like the others.
In addition to ORDER BY, several other clauses found in queries need to be modified to get the same answer from PostgreSQL that you get from VoltDB. For timestamp columns, where VoltDB uses
PostgreSQL instead uses:
For integer columns (including types TINYINT, SMALLINT, INTEGER, and BIGINT), the AVG, CEILING, and FLOOR functions return an integer value in VoltDB, but in PostgreSQL they return a NUMERIC value. For string concatenation, VoltDB allows the use of either the || or the + operator, but PostgreSQL allows only the former. Hence, queries involving any of these things must be modified before running them in PostgreSQL.
In addition, there are differences in DDL definitions between VoltDB and PostgreSQL:
- PostgreSQL does not support a TINYINT (one-byte integer) type, so any TINYINT columns must be changed to SMALLINT (a two-byte integer).
- PostgreSQL does not support the syntax VARCHAR(n BYTES), where n is an integer specifying the number of bytes to be used for storing the column; only the more common syntax VARCHAR(m), where m is an integer specifying the number of characters to be stored, is supported (VoltDB supports both syntaxes).
- Since PostgreSQL does not use partitioned tables for fast performance (a feature unique to VoltDB), there is no ASSUMEUNIQUE keyword; the closest available keyword is simply UNIQUE.
- PostgreSQL has no exact equivalent for VoltDB’s VARBINARY column type; the closest available is the BYTEA type.
- PostgreSQL uses certain synonyms for some column types, e.g. int2 for SMALLINT, int4 for INTEGER, int8 for BIGINT, float8 for FLOAT, and text for VARCHAR. Since these column-type synonyms are sometimes returned by the JDBC driver’s getColumnTypeName method, we need to know how to interpret them.
To handle these differences between VoltDB and PostgreSQL, the PostgreSQLBackend.java class transforms SQL statements (including DDL) that work in VoltDB to modified statements that work in PostgreSQL. (It also creates and, when finished, closes a JDBC connection to a PostgreSQL database.) It does this by defining a regex pattern for each of the specific SQL statement types described above, and a QueryTransformer object that describes how that type of statement should be modified to work in PostgreSQL. The QueryTransformer class is defined in NonVoltDBBackend.java, the base class of PostgreSQLBackend.
Using these transformations, it is now possible to run the SQLCoverage program to test VoltDB against PostgreSQL (rather than HSQLDB). We now run such tests daily in Jenkins against the “master” branch.
Testing VoltDB’s Geospatial Functionality Against PostGIS
Since HSQLDB does not support Geospatial data, using SQLCoverage to test the correctness of the Geospatial features added in version 6.0 relies solely on testing against PostgreSQL, and in particular, its PostGIS extension. For this purpose, five new test suites were added which run solely against PostgreSQL/PostGIS. Four of these test suites are based on the template files discussed in the previous blog post, in the context of testing integer and string (VARCHAR) columns. Test suites geo-basic-point and geo-basic-polygon run the query templates found in basic-template.sql, just as the basic-ints and basic-strings test suites do; test suites geo-advanced-point and geo-advanced-polygon run the query templates found in advanced-template.sql, just as the advanced-ints and advanced-strings test suites do; but using columns of the new GEOGRAPHY_POINT and GEOGRAPHY types, respectively. Values are defined for those column types in configure-for-geo-point.sql and configure-for-geo-polygon.sql, respectively.
A fifth test suite, geo-basic, is not based on any previous template and focuses on tests of the new Geospatial functions, as well as creating unusual points and polygons, e.g. at the North or South pole, on the Antimeridian (roughly, the International Date Line), or polygons that are very large or very small.
In addition to adding new test suites, the SQLCoverage test program has been adapted to work with PostGIS in three ways. First, some queries need to be modified to run in PostGIS. For example, this query in VoltDB:
SELECT DISTANCE(PT1, PT2) FROM P1
(where PT1 and PT2 are GEOGRAPHY_POINT columns) needs to be changed to this for PostGIS:
SELECT ST_DISTANCE(PT1, PT2, FALSE) FROM P1
The extra “FALSE” argument tells PostGIS to use the optional spherical Earth model so results more closely match those of VoltDB, which always uses a spherical Earth model. Similar modifications need to be made to the following VoltDB Geospatial functions (some with additional arguments, as above):
- DISTANCE: ST_DISTANCE
- CONTAINS: ST_COVERS
- AREA: ST_AREA
- CENTROID: ST_CENTROID
- DWithin: ST_DWithin
- AsText: ST_AsText
- PointFromText: ST_GeographyFromText
- PolygonFromText: ST_GeographyFromText
- LONGITUDE: ST_X
- LATITUDE: ST_Y
- NumPoints: ST_NPoints
- NumInteriorRings: ST_NumInteriorRings
- IsValid: ST_IsValid
- IsInvalidReason: ST_IsValidReason
- CAST(G AS VARCHAR), where G is a GEOGRAPHY_POINT or GEOGRAPHY column: ST_AsText
Also, DDL modifications need to be made for GEOGRAPHY_POINT and GEOGRAPHY column types.
In addition, the normalizer was extended to round off result values (columns or expressions), not only of type FLOAT (note that many of the Geospatial functions mentioned above return a FLOAT value, namely DISTANCE, AREA, LONGITUDE, and LATITUDE), but also of type VARCHAR in those cases where the VARCHAR value was a representation of GEOGRAPHY_POINT or GEOGRAPHY type. For example, the AsText function returns a VARCHAR value, which may be something like ‘POINT (1.23456 -6.54321)’ or ‘POLYGON (0.0 0.0, 1.0 0.0, 0.0 1.0, 0.0 0.0)’. Numerical values within those strings need to be rounded off sufficiently to match.
In the case of Geospatial data, because of differences between VoltDB’s and PostGIS’s models of the Earth, this usually means rounding to eight significant digits. In certain cases, VoltDB rounds to four digits because PostGIS’s ST_CENTROID function, which is analogous to VoltDB’s CENTROID function, is geometric, not geographic, which means it assumes a flat Earth.
Also, for Geospatial types, the default result format in a SELECT statement is different between VoltDB and PostGIS. In VoltDB, if you SELECT a GEOGRAPHY_POINT or GEOGRAPHY column, its value will be returned in Well-known text (WKT) format; in PostGIS, its value will be returned in Well-known binary (WKB) format. Both formats are standardized representations of Geospatial data, but WKT is much easier for a human to understand. Consequently, in order for results to match between VoltDB and PostGIS, the SQLCoverage test program wraps a call to the AsText function around any GEOGRAPHY_POINT or GEOGRAPHY column in the SELECT list, assuming there is not already a function wrapped around it.
For example, the simple “B1” query template discussed in the recent blog post, e.g., in the basic-strings test suite:
SELECT VCHAR B1 FROM P1
becomes, in the geo-basic-point and geo-basic-polygon test suites, respectively:
SELECT AsText(PT1) B1 FROM P1
SELECT AsText(POLY1) B1 FROM P1
where PT1 is a GEOGRAPHY_POINT column and POLY1 is a GEOGRAPHY column. Note the addition of the AsText function, which was not present in earlier versions of this query. Transformed to run in PostGIS, these two queries become:
SELECT ST_AsText(PT1) B1 FROM P1
SELECT ST_AsText(POLY1) B1 FROM P1
The SQLCoverage test program is an important part of testing the correctness of the SQL results returned by VoltDB, which we take very seriously. The recent additions to SQLCoverage described in this blog post extend that testing in two important ways. First, they allow the comparison of VoltDB results against PostgreSQL, which is more widely used and respected than HSQLDB, and provides a more independent comparison given that VoltDB includes a modified version of HSQLDB’s parser. Second, using PostgreSQL’s PostGIS extension allows the comparison of SQL results using VoltDB’s new Geospatial data types.