Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a shadow table for faster rowgroup pruning #36

Open
cldellow opened this issue Jul 19, 2018 · 0 comments
Open

use a shadow table for faster rowgroup pruning #36

cldellow opened this issue Jul 19, 2018 · 0 comments

Comments

@cldellow
Copy link
Owner

See discussion in #34.

ATM we prune row groups by doing a linear scan over the row group statistics. For ~10-20K row groups, this takes ~20ms. Where possible, it'd be nice to do this pruning via a faster mechanism. This could make some parquet-backed tables almost competitive with SQLite itself for needle in a haystack queries. eg I think this is possible:

  • sqlite: 0.1ms
  • parquet w/fast rowgroup pruning: 1-2ms
  • parquet w/linear rowgroup pruning: 20ms

Option 1: SQLite shadow table

Perhaps on table connect we could populate a shadow table with the statistics, then use that at query time to generate an array of row group IDs to traverse.

I'm not as sure that this is a slamdunk. SQLite doesn't support using more than 1 index in a query, so it will often end up doing a table scan.

Option 2: our own indexing

We could do the indexing ourself in C. I'm not enthused about this.

I guess we could sort the ranges by their lower bound into some array arr.

Then binary search arr for lower bound < our lowest value (order by lo desc limit 1) and lower bound > our highest value (order by low asc limit 1), which gives us the bounds to search.

Actually, I guess we could translate that into a series of SQLite calls.

Option 3: constrain the problem

We could support it only for monotonically increasing values -- then we could do 2 binary searches to determine the lower and upper bounds of the row groups to search. We could either do the search in C, or delegate it to SQLite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant