How does Lance compare to Vortex? #3130

philippemnoel · 2024-11-15T12:50:01Z

philippemnoel
Nov 15, 2024

Hey everyone! Phil here from @paradedb. We're pretty interested in a Parquet/Arrow successor and are considering both Lance and Vortex for the fast random access read. Could you please share how Lance compares to Vortex in your own words? When should one consider one vs the other?

westonpace · 2024-11-15T15:09:44Z

westonpace
Nov 15, 2024
Maintainer

I think there are a few points they differ today. Lance has both a table format and a file format while Vortex is focused on the file format at the moment so I'll focus the comparison between those two. I'm not really an expert in Vortex so this will be mostly about what we've been focused on.

I think it is safe to say the Vortex team has put more effort into compressive encodings. This will probably remain true for a while. Compression hasn't been all that vital to Lance as most of our customers are doing vector search and 90% or more of their data is pre-compressed anyways (e.g. vector embeddings, images, etc.) That being said, we're making sure we have a good story for string compression in 2.1 as large-string datasets (e.g. web crawlers, NLP datasets, etc.) are a key use case for us. Performance-wise this will be most noticeable when doing OLAP style queries.

Lately, the Lance file format has been more focused on structural encodings (list / struct), I/O scheduling, backpressure, combining columns, and large-ish objects (e.g. those over 1KiB).

If your goal is to do OLAP with scalar data in memory or NVME then my guess is Vortex would give you better speed.

If you've got deeply nested data, very large objects, etc. then Lance will give you a fast and robust solution (I can't say that Vortex won't because I really don't know).

a Parquet/Arrow successor

Parquet has a few selling points that aren't going to fade anytime soon:

Universal tooling support
Extensive fuzz testing (the Lance reader will likely panic, or worse, given untrusted data)
Pretty much infinite backwards compatibility (we'd like to keep Lance moving forwards and I can't guarantee we will have the bandwidth to ensure we always support older encodings)

When should one consider one vs the other?

Ideally, you should make the storage format an abstraction that is completely hidden from your users and occasionally test which works best for your use cases and adapt as needed.

As for today, I'd say Lance 2.0 is pretty solid and well tested for search solutions with good enough OLAP to beat out any row-based alternative. You'd probably still want parquet for pure-olap. Lance 2.X and Vortex will be better than Lance 2.0 and, at some point, also stable and robust.

2 replies

philippemnoel Nov 15, 2024
Author

Thank you, this is incredibly useful! Congrats on your work, it's extremely impressive :)

rkunnamp Nov 16, 2024

@westonpace I have been looking to replace delta-rs with lancedb for a medium scale analytics use case .
In my local tests, lancedb is always faster than delta-rs . So I was increasingly hopeful of replacing delta-rs
To be honest , I always thought lancedb is an incredible piece of engineering effort, something which is not getting the kind of attention that it truly deserves.

But I just got panic after reading "(the Lance reader will likely panic, or worse, given untrusted data)"

Do you mean to say that lancedb is not a good fit for medium scale OLAP usecase (cases where there are large number if tables...but the number of rows in a table are less than say 100 million)

Could you explain the use case where lancedb could give untrusted data

eddyxu · 2024-11-15T16:58:31Z

eddyxu
Nov 15, 2024
Maintainer

Hi, @philippemnoel. This is Lei from LanceDB.

Lance format today is a combination of data format (parquet/ORC/vortex), table format (schema, versioning, global unique id, etc), and secondary indices. Our primary optimization target is data serving (LanceDB) while maintaining a decent columnar scan speed for aggregation (at least not slower than Parquet).

Because the Lance format is used in LanceDB, many IO optimizations are done to improve query plans for search and reduce tail latency. We've seen amazing performance in the field for high-traffic search systems, i.e., <10ms p50 for full-text, vector, and metadata searches. Because the table format, secondary index, and data format are designed end-to-end, we can move faster across the stack.

However, as mentioned by Weston already, we don't mean to make this a pure OLAP engine / data warehouse.

You can also watch our Ray summit talk https://www.youtube.com/watch?v=xmTFEzAh8ho&list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&index=28

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Lance compare to Vortex? #3130

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How does Lance compare to Vortex? #3130

philippemnoel Nov 15, 2024

Replies: 2 comments · 2 replies

westonpace Nov 15, 2024 Maintainer

philippemnoel Nov 15, 2024 Author

rkunnamp Nov 16, 2024

eddyxu Nov 15, 2024 Maintainer

philippemnoel
Nov 15, 2024

Replies: 2 comments 2 replies

westonpace
Nov 15, 2024
Maintainer

philippemnoel Nov 15, 2024
Author

eddyxu
Nov 15, 2024
Maintainer