Cost-based query optimizations in multithreaded environments

Query optimizations are a vast topic which contains plenty of subtopics varying from projection, selectivity, and join optimizations. Not so many database books out there mentions or explains query optimization phenomena. So, in here we are going to both talk about query optimizations and one concrete multi threaded deterministic solution to selectivity cost optimizations.

There are various ways to store data statistics in a database system. These are mainly:

Probabilistic Data Structures
Histograms
Zone Maps
other methods are not used extensively. We will talk about the downside of other methods in upcoming posts.

There are some constraints of aggregating query statistics and building optimizations on top of it:

Concurrent Reads and Writes
Query loops are freestanding re-entrant code
- It is not advised to use locks over shared memory in the query pipeline.
- We shouldn't lock or wait between multiple queries to submit/retrieve data statistics.
Since statistics data retrieval will be used in optimization step, data fetching/submission calls should be amortizable by the optimizer's performance.

With that in mind:

Probabilistic Data Structures

Satisfies most of the requirements mentioned above. Only downside is that concurrent sketches are really hard to implement and mostly non-existent in this context. From my point of view concurrent sketches can be considered as over-engineering.

Histograms

One of the most prominent ones. Still histograms need to support concurrent RW and registry of metrics throughout the time. This is not a hard necessity for query optimization. But making histograms generalized to accept other metrics can be also be used for query statistics.

Zone Maps

Oracle's solution for querying their IMCU(in-memory compression unit) ¹. If designed with concurrency in mind, will provide good data statistics that query optimizers can use for pruning unnecessary data for filter predicates. This shouldn't be confused with filter pushdown which is yet another optimization.

Zone maps are like synopses implemented by SAP HANA system. In a sense zone maps are more widely used alternatives than SAP HANA synopses. They are reflecting constraint data statistics where data heavily hit by a filter combinations. Zone maps heavily exploit its' design in mostly ordered data compared to other alternatives out there. Also in the all optimization constraint satisfaction, it is the most flexible one so far to implement highly concurrent or even type agnostic manner.

This approach might have make people think that it is useful for high volume of data, which is not the case in most of the approaches too. Amortization is achieved by frequency and sequential organization of the tuple constraints and their aggregations. This idea has been presented years ago, in 1998, this research has been already experimented and Small Materialized Aggregates (SMA) principle has been born ². For the most of the aggregates SMA can be built on top of zone implementation for zone maps. This flexibility can be kept in mind while designing zone maps/synopses. In Lever, zone maps kept flexible and inspired by SMA approach too.

Selectivity based optimizations are what makes query engines faster and one of the crucial optimizations out there. I am working on lever for more than a year now where I am working on implementing a concurrency toolkit for data intensive systems. One of my last work was implementing completely wait-free zone map for constraint based query optimization.

Meanwhile implementing this mechanism I've heavily drawn from HANA's synopsis consistency mechanism embedded inside the Lever's zone map approach in addition to SMA approach mentioned above.

Constraint data statistics are useful for couple of other things, like semi-join reduction optimization. Which is useful for partition-wise join operations. Imagine a situation where having a multi column data partitioned by day. In this case joining over data combined with filter expression exhausts zone map implementation for pulling selectivity.

A demonstrative example for usage of zone map can be a filter followed by an aggregation(like sum in our example). This will allow us to take the unnecessary burden of processing complementary set which is not needed for our filter rule.

With that zone map prunes all zones that are not satisfying the constraint and returns a zone range to do the filtering over. One can ask that what is the speedup this optimization brings, benchmark results showed that it bring more than 30Kx speedup for simple unidirectional scan followed by aggregation.

bench_unoptimized       time:   [9.1430 ms 9.4118 ms 9.6771 ms]                              

bench_zonemap_selected  time:   [274.16 ns 275.11 ns 276.11 ns]                                    
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

For the example benchmarks feel free to look into Lever's benchmarks.

In addition to the zone ranges, lever's zone maps also can give direct selectivity of total spanning zones. This allows caller to compute EXPLAIN implementations and any projection optimizations that happens throughout the execution of the query without doing this in physical plan execution phase. Example for caller side follows:

If zone map updates are combined with data loading/offloading threads, this can also prunes validity checking that zone map needs. Simple solution is that always store a zone map with your tables so zone maps can update themselves when new data arrives or departs. This approach resolves the problem of staleness tracking and all the other staleness related issues.

Moreover, in Lever, I have implemented eventually consistent internal zone map statistics too. That said, let's say, you've checked zone map for column A of a table 100 times, but column B is checked 50 times, since these statistics are also automatically updated within all the threads that are working on query processing simultaneously without contention or lock in their path, they also removes the need of having Least Recently Used (LRU) cache or any in-memory caching logic for data evictions based on usage that we are most probably holding in some side cache.

I suggest using Lever's zone map in your projects and experiment with it in your query pipeline. In future, I would like to create a wrapper over zone maps to incorporate SMA approach to prune aggregation calculation stage based on expressions too. Since that is old, but gold trick that most of the databases are using, it will bring query optimizations to next level for aggregations. Shaving off any computation through predominant materialization is the key solution for decreasing latency in query pipelines.

I hope you've enjoyed this post and learned different things and feel included to my research and experimentation. These are the topics that I experiment and find solutions. If you would like to help to my efforts use my projects, share this post with your circle, or help me through GitHub Sponsor's page that I have. Thanks for your support.

Resources: BibTeX

Ziauddin, M., Witkowski, A., Kim, Y. J., Potapov, D., Lahorani, J., & Krishna, M. (2017). Dimensions based data clustering and zone maps. Proceedings of the VLDB Endowment, 10(12), 1622–1633. https://doi.org/10.14778/3137765.3137769 ↩
Moerkotte, G. (1998). Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. Proceedings of the 24rd International Conference on Very Large Data Bases, 476–487. ↩

Cost-based query optimizations in multithreaded environments

Theo Mahmut Bulut

Probabilistic Data Structures

Histograms

Zone Maps

Search

Topics