关于 HiveSQL 常见的 Left Join 误区，你知道吗-编程学习网

关于LEFT JOIN

外连接是我们书写SQL时经常使用的多表连接方式，使用起来也是十分的简单。值得注意的是，越是简单的东西，越是容易被忽略细节。通常我们都是这样理解LEFT JOIN的：

语义是满足Join on条件的直接返回，但不满足情况下，需要返回Left Outer Join的left 表所有列，同时右表的列全部填null

上述对于LEFT JOIN的理解是没有任何问题的，但是里面有一个误区：谓词下推。具体看下面的实例：

假设有如下的三张表：

--建表
create table t1(id int, value int) partitioned by (ds string);
create table t2(id int, value int) partitioned by (ds string);
create table t3(c1 int, c2 int, c3 int);
--数据装载，t1表
insert overwrite table t1 partition(ds='20220120') select '1','2022';
insert overwrite table t1 partition(ds='20220121') select '2','2022';
insert overwrite table t1 partition(ds='20220122') select '2','2022';

--数据装载，t2表
insert overwrite table t2 partition(ds='20220120') select '1','120';

当我们执行如下的SQL查询时，会返回什么数据呢?

SELECT  *
FROM    t1
LEFT JOIN t2
ON      t1.id = t2.id
AND     t1.ds = '20220120'
;

结果1：

1 2022 20220120 1 120 20220120

结果2：

1 2022 20220120 1 120 20220120
2 2022 20220121 NULL NULL NULL
1 2022 20220122 NULL NULL NULL

相信对于很多初学者，甚至是一个有开发经验的人来说，会认为结果1是正确的返回结果。其实结果1的并不是正确的结果，真正的返回值是结果2.

是不是跟预期的结果不一致呢?很多初学者会认为上述查询SQL中AND t1.ds = '20220120'会进行谓词下推，从而得到结果2。其实，SQL本身的语义不是这样的，如果需要获取结果1的数据，正确的查询方式是下面这样：

--方式1:
SELECT  *
FROM    t1
LEFT OUTER JOIN t2
ON      t1.id = t2.id
WHERE   t1.ds = '20220120'
;
--方式2:

SELECT  *
FROM    (
            SELECT  *
            FROM    t1
            WHERE   ds = '20220120'
        ) t1
LEFT OUTER JOIN t2
ON      t1.id = t2.id
;

细心的你看出差异了吗?重点是在WHERE t1.ds = '20220120'过滤条件上，最上面的查询方式是ON t1.ds = '20220120'，所以按照LEFT JOIN的语义，如果没有过滤条件，那么左表的数据应该全部返回，右表匹配不上则补null。

执行计划

我们先来看看没有谓词下推的查询SQL的执行计划

正常LEFT JOIN

查看执行计划

EXPLAIN
SELECT  *
FROM    t1
LEFT JOIN t2
ON      t1.id = t2.id
AND     t1.ds = '20220120'
;

执行计划结果

hive> EXPLAIN
    > SELECT  *
    > FROM    t1
    > LEFT JOIN t2
    > ON      t1.id = t2.id
    > AND     t1.ds = '20220120'
    > ;
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:t2
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:t2
          TableScan
            alias: t2
            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), value (type: int), ds (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                filter predicates:
                  0 {(_col2 = '20220120')}
                  1
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), value (type: int), ds (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                filter predicates:
                  0 {(_col2 = '20220120')}
                  1
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

从上面的执行计划可以看出：总共有3个stage，

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3

其中stage4是map任务读取t2表，将t2表加载成HashTable，用于map端join。t2表数据量为1行。

Select Operator expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator

stage3是map任务读取t1表数据并执行map端join。t1表数量为3行，可见并没有进行过滤操作。

 Map Operator Tree:
      TableScan
        alias: t1
        Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: id (type: int), value (type: int), ds (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE

Stage-0进行结果输出，最终并未执行过滤操作。

Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

谓词下推的LEFT JOIN

查看执行计划

EXPLAIN
SELECT  *
FROM    t1
LEFT OUTER JOIN t2
ON      t1.id = t2.id
WHERE   t1.ds = '20220120'
;

执行计划结果

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:t2
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:t2
          TableScan
            alias: t2
            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), value (type: int), ds (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t1
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), value (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col3, _col4, _col5
                Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), '20220120' (type: string), _col3 (type: int), _col4 (type: int), _col5 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                  Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

从上面的执行计划可以看出：总共有3个stage，

STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3

其中stage4是map任务读取t2表，将t2表加载成HashTable，用于map端join。t2表数据量为1行。

 TableScan
        alias: t2
        Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: id (type: int), value (type: int), ds (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
          HashTable Sink Operator

stage3是map任务读取t1表数据并执行map端join。t1表数量为1行，执行了过滤操作。

TableScan
            alias: t1
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), value (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col0 (type: int)
                  1 _col0 (type: int)
                outputColumnNames: _col0, _col1, _col3, _col4, _col5
                Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE

Stage-0进行结果输出，最终并未执行过操作。

Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink

总结本文主要结合具体的使用示例，对HiveSQL的LEFT JOIN操作进行了详细解释。主要包括两种比较常见的LEFT JOIN方式，一种是正常的LEFT JOIN，也就是只包含ON条件，这种情况没有过滤操作，即左表的数据会全部返回。另一种方式是有谓词下推，即关联的时候使用了WHERE条件，这个时候会会对数据进行过滤。所以在写SQL的时候，尤其需要注意这些细节问题，以免出现意想不到的错误结果。

文章详情

关于 HiveSQL 常见的 Left Join 误区，你知道吗

关于LEFT JOIN

执行计划

正常LEFT JOIN

谓词下推的LEFT JOIN

软考中级精品资料免费领

相关文章

猜你喜欢

关于 HiveSQL 常见的 Left Join 误区，你知道吗

关于云计算和区块链的常见误解，你知道多少？

关于5G的误解，你知道吗？

中小型组织常见的网络安全误区，你知道了吗？

那些关于DOM的常见Hook封装，你知道几个？

MySQL 那些常见的错误设计规范,你都知道吗

关于区块链技术与比特币的关系，你知道吗？

你知道常见的 java spring 注解错误有哪些吗？(常见的java spring注解错误有哪些)

你知道吗？在ASP和JavaScript中，函数和数组有哪些常见的错误使用方式？