Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Sevenannn · 2024-11-14T19:52:20Z

Which issue does this PR close?

N/A

Rationale for this change

This PR serves as an minor performance improvement for changes in Improve TableScan with filters pushdown unparsing (joins) #13132
when rewriting plans that has aggregates with lhs / rhs with filter and scan containing same filter.

For query

        select
            c_custkey,
            count(o_orderkey)
        from
            customer left outer join orders on
                        c_custkey = o_custkey
                    and o_comment not like '%special%requests%'
        group by
            c_custkey

The logical plan is

+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | BytesProcessedNode                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |   Federated                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |  Projection: customer.c_custkey, count(orders.o_orderkey)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |   Aggregate: groupBy=[[customer.c_custkey]], aggr=[[count(orders.o_orderkey)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |     Left Join:  Filter: customer.c_custkey = orders.o_custkey                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |       TableScan: customer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |       Filter: orders.o_comment NOT LIKE Utf8("%special%requests%")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |         TableScan: orders, partial_filters=[orders.o_comment NOT LIKE Utf8("%special%requests%")]

The rewritten query will be:
SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN orders ON ((customer.c_custkey = orders.o_custkey) AND (orders.o_comment NOT LIKE '%special%requests%' AND orders.o_comment NOT LIKE '%special%requests%')) GROUP BY customer.c_custkey

Under the current approach, the filter orders.o_comment NOT LIKE Utf8("%special%requests%") will occur twice in final query, although this has no effect on query result correctness, it brings performance overhead by including duplicated conditions.

What changes are included in this PR?

Use Vec to store filter and preserve ordering
Check if filter exist in Vec when adding filter

Are these changes tested?

Yes

Are there any user-facing changes?

No

* Eliminate duplicated filter within (filter(TableScan)) plan * Updates * fix * add test * fix

jayzhan211 · 2024-11-14T23:38:43Z

datafusion/sql/src/unparser/utils.rs

@@ -318,7 +318,9 @@ pub(crate) fn try_transform_to_simple_table_scan_with_filters(
                plan_stack.push(alias.input.as_ref());
            }
            LogicalPlan::Filter(filter) => {
-                filters.push(filter.predicate.clone());
+                if !filters.contains(&filter.predicate) {


Can we change them to HashSet? including table_scan_filters.

contains is O(n) operation

Yeah my previous implementation is actually Hashset, however it doesn't preserve the filter order, so I didn't end up using Hashset

Another thing I could do is maintain a temporary Hashset to keep track of the exsiting filters, while the results is still constructed with the Vector. Every filter will be checked if it exists with Hashset before pushing to the Vector. Would this approach be better than just using Vector.contains?

How about IndexSet?

IndexSet would be better so that the order of the filters is preserved

alamb

Thank you @Sevenannn -- this is a great idea. Thank you @jayzhan211 for the review

Sevenannn added 2 commits November 14, 2024 11:37

Eliminate duplicated filter within (filter(TableScan)) plan (#51)

b909c5f

* Eliminate duplicated filter within (filter(TableScan)) plan * Updates * fix * add test * fix

Preserve the filter order when eliminating duplicated filter #56

a3b5253

github-actions bot added the sql SQL Planner label Nov 14, 2024

Sevenannn changed the title ~~Qianqian/filter fix~~ Fix Duplicated filters within (filter(TableScan)) plan Nov 14, 2024

jayzhan211 reviewed Nov 14, 2024

View reviewed changes

alamb reviewed Nov 15, 2024

View reviewed changes

alamb changed the title ~~Fix Duplicated filters within (filter(TableScan)) plan~~ Fix Duplicated filters within (filter(TableScan)) plan for unparser Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Sevenannn commented Nov 14, 2024

jayzhan211 Nov 14, 2024 •

edited

Loading

Sevenannn Nov 15, 2024

Sevenannn Nov 15, 2024

jayzhan211 Nov 15, 2024 •

edited

Loading

alamb Nov 15, 2024

alamb left a comment

Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Are you sure you want to change the base?

Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422

Conversation

Sevenannn commented Nov 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Sevenannn Nov 15, 2024

Choose a reason for hiding this comment

Sevenannn Nov 15, 2024

Choose a reason for hiding this comment

jayzhan211 Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Nov 15, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

jayzhan211 Nov 14, 2024 •

edited

Loading

jayzhan211 Nov 15, 2024 •

edited

Loading