You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ATM we prune row groups by doing a linear scan over the row group statistics. For ~10-20K row groups, this takes ~20ms. Where possible, it'd be nice to do this pruning via a faster mechanism. This could make some parquet-backed tables almost competitive with SQLite itself for needle in a haystack queries. eg I think this is possible:
sqlite: 0.1ms
parquet w/fast rowgroup pruning: 1-2ms
parquet w/linear rowgroup pruning: 20ms
Option 1: SQLite shadow table
Perhaps on table connect we could populate a shadow table with the statistics, then use that at query time to generate an array of row group IDs to traverse.
I'm not as sure that this is a slamdunk. SQLite doesn't support using more than 1 index in a query, so it will often end up doing a table scan.
Option 2: our own indexing
We could do the indexing ourself in C. I'm not enthused about this.
I guess we could sort the ranges by their lower bound into some array arr.
Then binary search arr for lower bound < our lowest value (order by lo desc limit 1) and lower bound > our highest value (order by low asc limit 1), which gives us the bounds to search.
Actually, I guess we could translate that into a series of SQLite calls.
Option 3: constrain the problem
We could support it only for monotonically increasing values -- then we could do 2 binary searches to determine the lower and upper bounds of the row groups to search. We could either do the search in C, or delegate it to SQLite.
The text was updated successfully, but these errors were encountered:
See discussion in #34.
ATM we prune row groups by doing a linear scan over the row group statistics. For ~10-20K row groups, this takes ~20ms. Where possible, it'd be nice to do this pruning via a faster mechanism. This could make some parquet-backed tables almost competitive with SQLite itself for needle in a haystack queries. eg I think this is possible:
Option 1: SQLite shadow table
Perhaps on table connect we could populate a shadow table with the statistics, then use that at query time to generate an array of row group IDs to traverse.
I'm not as sure that this is a slamdunk. SQLite doesn't support using more than 1 index in a query, so it will often end up doing a table scan.
Option 2: our own indexing
We could do the indexing ourself in C. I'm not enthused about this.
I guess we could sort the ranges by their lower bound into some array arr.
Then binary search arr for lower bound < our lowest value (order by lo desc limit 1) and lower bound > our highest value (order by low asc limit 1), which gives us the bounds to search.
Actually, I guess we could translate that into a series of SQLite calls.
Option 3: constrain the problem
We could support it only for monotonically increasing values -- then we could do 2 binary searches to determine the lower and upper bounds of the row groups to search. We could either do the search in C, or delegate it to SQLite.
The text was updated successfully, but these errors were encountered: