加班遇到一个SQL问题,本想把别人的SQL改下成SparkSQL,在YARN上运行,然而数据一直对不上。
原SQL
⚠️说明:a.id,b.id没有空的,数据1:1,b.name可能存在空的
select a.id,b.id,b.name
from table_a a 
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where 
exists(select 1 from  table_b c where a.id = c.id and c.is_delete = 0
)改后的
想法是:既然exists过滤了,为什么不直接inner join呢,于是乎:
select a.id,b.id,b.name
from table_a a 
inner join table_b b on a.id = b.id and b.is_delete = 0  求助群友
问了下群友,区别是我一直以为left join 后 b.name is not null并没什么用,就没有带,我问群友,下面这两个SQL有什么区别:
select *
from table_a a 
left join table_b b on a.id = b.id and b.is_delete = 0 
where 
exists(select 1 from  table_b c where a.id = c.id and c.is_delete = 0
)select *
from table_a a 
inner join table_b b on a.id = b.id and b.is_delete = 0  群友问AI的结果

这样的回答,不太对
自悟
然后仔细去品味这两个SQL有什么不同
-- (1)
select a.id,b.id,b.name
from table_a a 
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where 
exists(select 1 from  table_b c where a.id = c.id and c.is_delete = 0
)-- (2)select a.id,b.id,b.name
from table_a a 
inner join table_b b on a.id = b.id and b.is_delete = 0  数据table_a
| id | 
| 1 | 
| 2 | 
| 3 | 
数据table_b
| id | name | id_delete | 
| 1 | aa | 0 | 
| 2 | bb | 0 | 
| 3 | NULL | 0 | 
结论:
(1)sql计算后的会剔除掉 table_a 不符合 a.id = c.id and c.is_delete = 0 条件的数据。加上 name is not null。最后的数据会出现这两类情况:
a.id,null,null   # name为null,b表全部为空
a.id,b.id,b.name # 全部有值| a.id | b.id | b.name | 
| 1 | 1 | aa | 
| 2 | 2 | bb | 
| 3 | NULL | NULL | 
(2)sql计算后中则会出现这两类情况:
a.id,b.id,null   
a.id,b.id,b.name # 全部有值| a.id | b.id | b.name | 
| 1 | 1 | aa | 
| 2 | 2 | bb | 
| 3 | 3 | NULL | 
所以,count的时候是没有问题的,两个都可以,如果是取具体的值有所区别。
拓展
如果a.id b.id 是1:n 呢?
数据table_a
| id | 
| 1 | 
| 2 | 
| 3 | 
| 10 | 
数据table_b
| id | name | id_delete | 
| 1 | aa | 0 | 
| 1 | NULL | 0 | 
| 2 | bb | 0 | 
| 3 | NULL | 0 | 
-- (1)
select a.id,b.id,b.name
from table_a a 
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where 
exists(select 1 from  table_b c where a.id = c.id and c.is_delete = 0
)| a.id | b.id | b.name | 
| 1 | 1 | aa | 
| 2 | 2 | bb | 
| 3 | NULL | NULL | 
-- (2)select a.id,b.id,b.name
from table_a a 
inner join table_b b on a.id = b.id and b.is_delete = 0 | a.id | b.id | b.name | 
| 1 | 1 | aa | 
| 1 | 1 | NULL | 
| 2 | 2 | bb | 
| 3 | 3 | NULL | 
所以,如果count的时候,1对n,相对于1:1是有区别的。
最后
(1)菜就要学,就要钻研。
(2)加班使人头疼,头脑不灵光。
(3)具体问题,具体分析。