use a shadow table for faster rowgroup pruning #36

cldellow · 2018-07-19T23:19:03Z

See discussion in #34.

ATM we prune row groups by doing a linear scan over the row group statistics. For ~10-20K row groups, this takes ~20ms. Where possible, it'd be nice to do this pruning via a faster mechanism. This could make some parquet-backed tables almost competitive with SQLite itself for needle in a haystack queries. eg I think this is possible:

sqlite: 0.1ms
parquet w/fast rowgroup pruning: 1-2ms
parquet w/linear rowgroup pruning: 20ms

Option 1: SQLite shadow table

Perhaps on table connect we could populate a shadow table with the statistics, then use that at query time to generate an array of row group IDs to traverse.

I'm not as sure that this is a slamdunk. SQLite doesn't support using more than 1 index in a query, so it will often end up doing a table scan.

Option 2: our own indexing

We could do the indexing ourself in C. I'm not enthused about this.

I guess we could sort the ranges by their lower bound into some array arr.

Then binary search arr for lower bound < our lowest value (order by lo desc limit 1) and lower bound > our highest value (order by low asc limit 1), which gives us the bounds to search.

Actually, I guess we could translate that into a series of SQLite calls.

Option 3: constrain the problem

We could support it only for monotonically increasing values -- then we could do 2 binary searches to determine the lower and upper bounds of the row groups to search. We could either do the search in C, or delegate it to SQLite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use a shadow table for faster rowgroup pruning #36

use a shadow table for faster rowgroup pruning #36

cldellow commented Jul 19, 2018

use a shadow table for faster rowgroup pruning #36

use a shadow table for faster rowgroup pruning #36

Comments

cldellow commented Jul 19, 2018

Option 1: SQLite shadow table

Option 2: our own indexing

Option 3: constrain the problem