-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
What happens?
I'm testing the performance of DuckDB on TPC-H with large SF (i.e. SF>100). For Q21, DuckDB runs very slowly and consumes a lot of memory. I believe the reason is that the Semi-Join/Anti-Join implemented here does not support the inequality condition.
To Reproduce
TPC-H schema is used here, or you can create a simplified table like this:
create table lineitem (l_orderkey int, l_suppkey int);
insert into lineitem values (1,1),(1,2),(3,3),(4,5),(5,5),(6,5);
Consider a over-simplified version of Q21,
select * from lineitem l1 where exists (
select * from lineitem l2
where
l2.l_orderkey = l1.l_orderkey
);
DuckDB generates a query plan like this:
That's cool and everything works well.
However, if I add an inequality condition (i.e. l2.l_suppkey <> l1.l_suppkey) like that in TPC-H Q21:
select * from lineitem l1 where exists (
select * from lineitem l2
where
l2.l_orderkey = l1.l_orderkey
and l2.l_suppkey <> l1.l_suppkey
);
DuckDB generates a terrible query plan:
As you can see, there is an INNER JOIN ON TWO LINEITEM TABLE! It makes this query run slowly and consume a lot of memory.
It seems feasible to have Semi-Join/Anti-Join additionally support the inequality condition on existing implementations....? Make an additional judgment on the given inequality condition for every join results generated by the equation condition?
OS:
Linux
DuckDB Version:
0.5.1
DuckDB Client:
Shell
Full Name:
Aqua
Affiliation:
CAS
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree