Best Practices for Index Optimization in Volt Active Data

Indexes provide a classic “space for speed” trade-off. They add to the persistent memory footprint of your application data but they make query filtering significantly faster. They also represent a trade-off that sacrifices write performance for read performance, on the assumption that indexed data will be filtered by queries more frequently than it is modified. If the Volt Active Data database designer follows some index optimization best practices when defining indexes, indexes will maximize query-filtering performance in return for minimum investments of persistent memory and computation overhead when writing data.

Here are a few tips to have effective indexes in Volt Active Data:

Avoid indexes that have a column list that is simply a prefix of another index’s column list. The index with the longer column list will usually serve the same queries as the shorter one. If the primary key of table X is (A, B, C), then an index on (A, B) is of little use. An index on (B, C) may be of use in this scenario or it may be more effective to define the primary key as (B, C, A) — if B is likely to be filtered in queries where A is not equality-filtered.
Avoid “low-cardinality” indexes — An index defined solely on a column that only has a few distinct values is usually not very effective. Because of its large number of duplicate values, it does little to narrow the set of rows to be sequentially filtered by other filters in the query. Such an index can sometimes cause the planner to fail to select a more effective index for a query or even a more efficient sequential scan. One way to increase index effectiveness of low cardinality indexes is to add other filtered columns to the index, keeping in mind that the effectiveness of an index for a query “tops out” at the first column that has an inequality filter — or before the second column that has an IN filter.
When deciding how to order columns within an index (or primary key or unique constraint) definition, columns that are more likely to be filtered with an exact equality, e.g. (A = ?), should be listed before columns that tend to be range-filtered (B <= ?). Queries that are run the most often or that benefit the most from indexing (perhaps because they lack filters that can be covered by other indexes) should weigh more heavily in this decision.
In some cases, with a roughly equal mix between queries using forms like “WHERE A = ? AND B <= ?” and other queries using forms like “WHERE A > ? AND B = ?”, it may be worthwhile to define indexes on both permutations — on X(A, B …) and on X(B, A …). Otherwise, when two or more columns in an index tend to both get equality filtered in combination, it is generally better to list a column first if it also tends to be filtered (without the other) in other queries. A possible exception to this rule is when the column has low cardinality (to avoid the ineffective use of the index).
Placing the low-cardinality column later in the index’s list prevents the index from being applied as a low-cardinality indexed filter and favors the selection of a more effective index or sequential scan.
Any non-unique filter that is listed in the catalog report as having no procedures using it is a candidate for elimination. But first, it may make sense to look for queries that would be expected to use the index and determine what they are using instead for scans on the table. It may be that the index chosen by the planner is not actually as effective and that index may be the better candidate for elimination. Also, note that indexes that are only used to support recalculation of min and max values in materialized views may be erroneously reported as unused.
Index optimization is best accomplished iteratively, eliminating or tuning an index on a table and seeing its effect on statements before making other changes to other competing indexes.

Volt Active Data indexes provide multiple benefits. They help to guard against unintended duplication of data. They help to optimize recalculation of min and max values in materialized views. Tree indexes in particular can also be used to replace memory- and processor- intensive sorting operations in queries that have ORDER BY and GROUP BY clauses. This discussion focuses on the benefits of indexes in implementing SQL WHERE clauses that filter query results.

There are several methods for constructing indexes in SQL:

— the PRIMARY KEY column attribute

— the UNIQUE or ASSUME UNIQUE column attribute

— the PRIMARY KEY table constraint

— the UNIQUE or ASSUME UNIQUE table constraint

— the CREATE INDEX command.

Any of these can be used to define a “UNIQUE” index on a single column. The table constraints and CREATE INDEX statements can also define a “UNIQUE” index on multiple columns or on expressions that use one or more columns. The CREATE INDEX statement can be used to construct a non-UNIQUE index on one or more columns or expressions.

All examples in this best practices discussion describe indexes as if they were created by the CREATE INDEX command, but the discussion applies generally to indexes defined using any of these methods.

The goals of these best practices are:

— To eliminate unused indexes

— To minimize redundancy (memory use and overhead for writes) among overlapping indexes on a table

— To minimize sequential filtering of large numbers of rows.

Sequential filtering always occurs when rows are accessed without the benefit of an index. This is known as a sequential scan. Sequential filtering can also occur on an indexed scan when there are more filters in the query than are covered by the index. The cost of sequential filtering is based on several factors. One factor is the number of filters being applied to each row. A major factor is the number of rows to which the filters must be applied.

The intent of an index is to use a more streamlined “lookup” algorithm to produce a small set of filtered rows, eliminating the need to sequentially apply (as many) filters to (as many) rows.

Since there are trade-offs and limitations involved in defining indexes, indexes may not provide complete coverage for all of the filters in a query. If any filters are not covered by an index, they must be sequentially applied. The cost of this action is typically reduced compared to a sequential scan of the data because the index reduces the two major contributing factors: the number of (remaining, uncovered) filters to apply, and the number of rows to which they must be applied.

The lookup algorithm used by an index is typically much more efficient than sequential application for the same set of filters and rows, but it does not have zero cost. It also slightly complicates the process of sequentially applying any remaining filters to any remaining rows. In fact, the worst-case scenario for query filter performance is when an index’s lookup algorithm is employed but fails to cover most of a query’s filters and fails to eliminate most of the table’s rows. This case can perform significantly worse than a sequential scan query that uses no index at all and applies all of its filters sequentially. This possibility calls for the elimination of ineffective indexes from a database.

An ideal set of index definitions minimizes the number of times that any filter is sequentially applied to any row in any query over the life of the database system. It accomplishes this with a minimum number of indexes, each of minimum complexity, to reduce persistent memory and data write computation costs.

A key insight into defining indexes is which of the filters in a query can be “covered” by a given index. Filters and combinations of filters qualify for coverage based on different criteria.

Each “scan” in a query, that is, each argument to a FROM clause that is not a subquery, can use up to one index defined on its table. When a table defines multiple indexes on the same table, these indexes compete in the query planner for the mission of controlling each scan in each query that uses the table.

The query planner uses several criteria to evaluate which one of the table’s indexes that cover one or more filters in the query is the most likely to be the most efficient.

When indexing a single column, as in “CREATE INDEX INDEX_OF_X_A ON X(A);”, a covered filter can be:

— “A <op> <constant>”, where <op> can be any of “=, <, >, <=, or >=” or it can be

— “A BETWEEN <constant1> AND <constant2>” or it can be

— “A IN <constant-list>” or it can be

— A special case of “A LIKE <string-pattern>” where <string-pattern> contains a fixed prefix followed by a wild-card character.

Here, “<constant>”, “<constant1>”, and “<constant2>” can be actual literal constants like 1.0 or ‘ABC’ or they can be ? parameters which resolve to constants at runtime.

“<constant-list>” can be a list of literals or literals and parameters like (‘ABC’, ‘BAC’, ‘BCA’, ‘ACB’, ‘CBA’, ‘BAC’) or (1, 2, 3, ?) or (?, ?, ?, ?, ?) or a single vector-valued ? parameter. Each of these “constants” can also be an expression of constants, such as ((1024*1024)-1).

Depending on the order in which tables are scanned in a query, called the “join order”, a covered filter can also be “A <op> <column>” where <column> is a column from another table in the query or any expression of a column or columns from another table and possibly constants, like B or (B || C) or SUBSTR( B||C , 1 , 4 ).

The “join order” dependency works like this. If you had two tables indexed on column A and your query was:

SELECT * FROM X, Y WHERE X.A = Y.A and X.B = ?;

only one table could be indexed. The first one to be scanned would have to use a sequential table scan.

If you also had an index on X.B, X could be index-scanned on B and Y could then be index-scanned on A, so a table scan would be avoided.

The availability of indexes that cover the scans of a query have a direct effect on the planner’s selection of the join order for a query. In this case, the planner would reject the option of scanning Y first, since that would mean one more sequential scan and one fewer index scan, and the planner prefers more index scans whenever possible on the assumption that index scans are more efficient.

When creating an index containing multiple columns, as in “CREATE INDEX INDEX_OF_X_A_B ON X(A, B);”, a covered filter can be any of the forms listed above for coverage by a simpler index “ON X(A)”, regardless of the presence of a filter on B — this is used to advantage when columns are added to an index to lower its cardinality, as discussed below.

A multi-column index “ON X(A, B) can be used more effectively in queries with a combination of filters that includes a filter on A and a filter on B. To enable the more effective filtering, the “first” filter or “prefix filter” on A must specifically have the form of “A = …” or “A IN …” – possibly involving column(s) of other tables, depending on join order – while the filter on B can be any form from the longer list of covered filters, above.

A specific exception to this rule is that a filter of the form “B IN …” does not improve the effectiveness of a filter of the form “A IN …”, but that same filter “B IN …” can be used with a filter of the specific form “A = …”. In short, each index is restricted to applying to only one “IN” filter per query. So, when the index is covering “A IN …”, it will refuse to cover the “B IN …” filter.

This extends to indexes on greater numbers of columns, so an index “ON X(A, B, C)” can generally be used for all of the filters and filter combinations described above using A or using A and B. It can be used still more effectively on a combination of prefix filters like “A = … ” ( or “A IN …” ) AND “B = …” ( or “B IN …” ) with an additional filter on C — but again, only the first “IN” filter improves the index effectiveness, and other “IN” filters are not covered.

When determining whether a filter can be covered as the “first filter” of an index or a “prefix filter” (first or second filter of an index on three or more columns, etc.), the ordering of the filters always follows the ordering of the columns in the index definition. So, “CREATE INDEX INDEX_ON_X_A_B ON X(A, B)” is significantly different from “CREATE INDEX INDEX_ON_X_B_A ON X(B, A)”. In contrast, the orientation of the filters as expressed in each query does not matter at all, so “A = 1 and B > 10” has the same effect on indexing as “10 < B and A = 1” etc. The filter “A = 1” is considered the “first” filter in both cases when the index is “ON (A, B)” because A is first.

Also, other arbitrary filters can be combined in a query with “AND” without disqualifying the covered filters; these additional filters simply add (reduced) sequential filtering cost to the index scan.

But a top-level OR condition like “A = 0 OR A > 100” will disqualify all filters and will not use any index.

A general pre-condition of a query’s filters eligible for coverage by a multi-column index is that the first key in the index must be filtered. So, if a query had no filter at all on A, it could not use any of the above indexes, regardless of the filters on B and/or on C. This is the condition that can cause table scans if there are not enough indexes, or if the indexes or queries are not carefully matched.

This implies that carelessly adding columns to the start of an already useful index’s list can make it less useful and applicable to fewer queries. Conversely, adding columns to the end of an already useful index (rather than to the beginning) is more likely to make the index just as applicable but more effective in eliminating sequential filtering. Adding to the middle of the list can cause an index to become either more or less effective for the queries to which it applies. Any such change should be tested by reviewing the catalog report and/or by benchmarking the affected queries. Optimal index use and query performance may be achieved either with the original definition of the index, with the changed definition, or by defining two indexes.

Note the above discussion applies to the Volt Active Data default “tree” indexes.

You can also define a more specialized hash index, which only covers queries that use equality or “IN” (limited to one “IN” filter per query, as above) to filter ALL of the keys in the index.

So, a hash index on column A requires a filter that looks like “A = …” or “A IN …”.

A hash index on A, B requires a filter that looks like “A = … AND B = …” or “A IN … and B = …” or “A = … AND B IN …” but will have no effect on only filters like “A = …” or “A = … AND B >= …” etc.

The more flexible default tree indexes are recommended over hash indexes, except in special cases where equality lookup performance on all keys is super-critical and filters on just the prefix key or range filter performance on the complete set of key(s) is not critical.

Even under these conditions, we recommend benchmarking hash indexes against tree indexes to verify their benefit for the scale of your data.

To recap, here are the best practices tips for defining indexes in Volt Active Data:

Avoid indexes that have a column list that is simply a prefix of another index’s column list. The index with the longer column list will usually serve the same queries as the shorter one. If the primary key of table X is (A, B, C), then an index on (A, B) is of little use. An index on (B, C) may be of use in this scenario or it may be more effective to define the primary key as (B, C, A) — if B is likely to be filtered in queries where A is not equality-filtered.
Avoid “low-cardinality” indexes — An index defined solely on a column that only has a few distinct values is usually not very effective. Because of its large number of duplicate values, it does little to narrow the set of rows to be sequentially filtered by other filters in the query. Such an index can sometimes cause the planner to fail to select a more effective index for a query or even a more efficient sequential scan. One way to increase index effectiveness of low cardinality indexes is to add other filtered columns to the index, keeping in mind that the effectiveness of an index for a query “tops out” at the first column that has an inequality filter — or before the second column that has an IN filter.
When deciding how to order columns within an index (or primary key or unique constraint) definition, columns that are more likely to be filtered with an exact equality, e.g. (A = ?), should be listed before columns that tend to be range-filtered (B <= ?). Queries that are run the most often or that benefit the most from indexing (perhaps because they lack filters that can be covered by other indexes) should weigh more heavily in this decision.
In some cases, with a roughly equal mix between queries using forms like “WHERE A = ? AND B <= ?” and other queries using forms like “WHERE A > ? AND B = ?”, it may be worthwhile to define indexes on both permutations — on X(A, B …) and on X(B, A …). Otherwise, when two or more columns in an index tend to both get equality filtered in combination, it is generally better to list a column first if it also tends to be filtered (without the other) in other queries. A possible exception to this rule is when the column has low cardinality (to avoid the ineffective use of the index).
Placing the low-cardinality column later in the index’s list prevents the index from being applied as a low-cardinality indexed filter and favors the selection of a more effective index or sequential scan.
Any non-unique filter that is listed in the catalog report as having no procedures using it is a candidate for elimination. But first, it may make sense to look for queries that would be expected to use the index and determine what they are using instead for scans on the table. It may be that the index chosen by the planner is not actually as effective and that index may be the better candidate for elimination. Also, note that indexes that are only used to support recalculation of min and max values in materialized views may be erroneously reported as unused.
Index optimization is best accomplished iteratively, eliminating or tuning an index on a table and seeing its effect on statements before making other changes to other competing indexes.

Please let us know if this is useful, and share your indexing strategies via info@voltactivedata.com. Learn more when you visit the Volt Active Data Documentation site.

Best Practices for Index Optimization in Volt Active Data

Best Practices for Index Optimization in Volt Active Data

Volt Active Data In-Memory Database Achieves Best-In-Class Results, Running in the Cloud, On the YCSB Benchmark

Optimizing Index Performance – A CMU Intern’s Work at Volt Active Data

Recent Posts

What is a Private 5G Network?

Top 3 Kafka Streams Challenges

ACID vs CAP: What’s the Difference?

Contact Us

Follow Us

Best Practices for Index Optimization in Volt Active Data

Best Practices for Index Optimization in Volt Active Data

Related Resources

Volt Active Data In-Memory Database Achieves Best-In-Class Results, Running in the Cloud, On the YCSB Benchmark

Optimizing Index Performance – A CMU Intern’s Work at Volt Active Data

Recent Posts

What is a Private 5G Network?

Top 3 Kafka Streams Challenges

ACID vs CAP: What’s the Difference?

Contact Us

Follow Us