Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Sevenannn
Copy link
Contributor

Which issue does this PR close?

N/A

Rationale for this change

For query

        select
            c_custkey,
            count(o_orderkey)
        from
            customer left outer join orders on
                        c_custkey = o_custkey
                    and o_comment not like '%special%requests%'
        group by
            c_custkey

The logical plan is

+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | BytesProcessedNode                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |   Federated                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |  Projection: customer.c_custkey, count(orders.o_orderkey)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |   Aggregate: groupBy=[[customer.c_custkey]], aggr=[[count(orders.o_orderkey)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |     Left Join:  Filter: customer.c_custkey = orders.o_custkey                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |       TableScan: customer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |       Filter: orders.o_comment NOT LIKE Utf8("%special%requests%")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |         TableScan: orders, partial_filters=[orders.o_comment NOT LIKE Utf8("%special%requests%")]              

The rewritten query will be:
SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN orders ON ((customer.c_custkey = orders.o_custkey) AND (orders.o_comment NOT LIKE '%special%requests%' AND orders.o_comment NOT LIKE '%special%requests%')) GROUP BY customer.c_custkey

Under the current approach, the filter orders.o_comment NOT LIKE Utf8("%special%requests%") will occur twice in final query, although this has no effect on query result correctness, it brings performance overhead by including duplicated conditions.

What changes are included in this PR?

  • Use Vec to store filter and preserve ordering
  • Check if filter exist in Vec when adding filter

Are these changes tested?

Yes

Are there any user-facing changes?

No

* Eliminate duplicated filter within (filter(TableScan)) plan

* Updates

* fix

* add test

* fix
@github-actions github-actions bot added the sql SQL Planner label Nov 14, 2024
@Sevenannn Sevenannn changed the title Qianqian/filter fix Fix Duplicated filters within (filter(TableScan)) plan Nov 14, 2024
@@ -318,7 +318,9 @@ pub(crate) fn try_transform_to_simple_table_scan_with_filters(
plan_stack.push(alias.input.as_ref());
}
LogicalPlan::Filter(filter) => {
filters.push(filter.predicate.clone());
if !filters.contains(&filter.predicate) {
Copy link
Contributor

@jayzhan211 jayzhan211 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change them to HashSet? including table_scan_filters.

contains is O(n) operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my previous implementation is actually Hashset, however it doesn't preserve the filter order, so I didn't end up using Hashset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I could do is maintain a temporary Hashset to keep track of the exsiting filters, while the results is still constructed with the Vector. Every filter will be checked if it exists with Hashset before pushing to the Vector. Would this approach be better than just using Vector.contains?

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about IndexSet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexSet would be better so that the order of the filters is preserved

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Sevenannn -- this is a great idea. Thank you @jayzhan211 for the review

@alamb alamb changed the title Fix Duplicated filters within (filter(TableScan)) plan Fix Duplicated filters within (filter(TableScan)) plan for unparser Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants