数据库性能优化之IN子查询优化-编程学习网

问题定义

为了获取最近一年内有订单的用户信息，可以使用以下的三种写法去实现，它们在语义上是等价的。那它们的性能如何，适用场景是什么？这是本文讨论的主题。

Query1 - IN子查询(= ANY)

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

Query2 - EXISTS子查询

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)

Query3 - JOIN方式

select c.* from customer c join (select distinct o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year) o where o.o_custkey = c.custkey

IN子查询

IN子查询并不一定是非相关子查询，但是为了讨论方便，本文所述的IN子查询为非相关子查询。

Query1 - IN子查询(= ANY)

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

IN子查询的伪代码实现逻辑：

执行子查询语句，并得到结果集并去重，并将结果集存储在临时表中。
将主查询中的值逐一与子查询结果集中的值进行比较，如果匹配成功，则返回该行数据。
在第二步的比较时。

可以将子查询的结果集转化为一个哈希表，然后对于主查询中的每一行，都在哈希表中查找该行的值是否存在。
可以在上面建立一个唯一性索引，通过此索引和外表进行关联。不论适用哪一种方式，它的实际复杂度都为O(1)

时间复杂度

它的时间复杂度为O(max(m,n)) + nlogn，其中，m是外表的记录数，n为子查询的记录数。

可以看到，如果子查询的记录数比较大时，其时间复杂度较大，性能较差。

EXISTS子查询

Query2 - EXISTS子查询

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)

实现逻辑如下：

对于主查询中的每一行，都执行一次子查询。
如果子查询返回的结果集不为空，则保留该行数据。

时间复杂度

因此它的时间复杂度为O(m*n), 其中m为外表的记录数，n为子查询的访问的记录数。

如果子查询中的orders没有索引，则n为orders表的行数。
如果orders上有筛选率比较大的索引，则n为索引所筛选出的记录数。

可以看出，如果EXISTS的子查询中有筛选率非常高的索引，使用EXISTS子查询的性能比较好。

Join方式

为了保证语义一致性，使用join方式需要先进行去重操作。

Query3 - JOIN方式：

select c.* from customer c join (select distinct o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year) o where o.o_custkey = c.custkey

对比IN子查询的执行计划，可以看到Join方式就是IN子查询的执行计划的SQL化表达。

如果如果子查询中的查询列是唯一的，那么可以将其转换为内连接，从而获得更好的性能。

数据库中的IN子查询优化

事实上，MySQL和PostgreSQL都可以对IN和EXISTS采取最优的执行计划。

如果没有O_ORDERDATE上的索引，Query1和Query2在MySQL上的执行计划都是采用IN子查询的伪代码实现逻辑：

-> Nested loop inner join  (cost=19847117.66 rows=198449671)
   -> Table scan on customer  (cost=1155.80 rows=9948)
   -> Single-row index lookup on  using  (o_custkey=customer.C_CUSTKEY)
       -> Materialize with deduplication  (cost=22471.48..22471.48 rows=19949)
           -> Filter: (orders.O_ORDERDATE = ((curdate() - interval 1 year)))  (cost=20476.61 rows=19949)
               -> Table scan on orders  (cost=20476.61 rows=199487)

如果在O_ORDERDATE建立一个索引，那么它们的执行计划都是采用EXISTS子查询的伪代码实现逻辑：

-> Nested loop semijoin  (cost=22777.29 rows=5705)
   -> Table scan on customer  (cost=1155.80 rows=9948)
   -> Filter: (orders.O_ORDERDATE = ((curdate() - interval 1 year)))  (cost=0.92 rows=1)
       -> Index lookup on orders using o_idx_key (O_CUSTKEY=customer.C_CUSTKEY)  (cost=0.92 rows=6)

如果子查询中的查询列是唯一的，那么数据库会将其转换为内连接。

譬如对于下面的SQL。

select * from orders where o_custkey in (select c_custkey from customer where c_phone like '139%')

MySQL的执行计划是这样的（PostgreSQL也是类似的）：

-> Nested loop inner join  (cost=3541.61 rows=6313)
   -> Filter: (customer.C_PHONE like '139%')  (cost=1148.89 rows=1099)
       -> Table scan on customer  (cost=1148.89 rows=9888)
   -> Index lookup on orders using idx_orders_ckey (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.60 rows=6)

可以看出，在MySQL和PostgreSQL数据库中，使用IN或是EXISTS的写法是等价的，数据库总是可以根据索引和统计信息采用最优的执行计划。

PawSQL中的IN子查询优化

PawSQL中会将IN子查询重写为EXISTS子查询或是内连接查询，从而帮助索引推荐引擎推荐合适的索引，促使优化器采用最优的执行计划。

IN子查询转换为EXISTS

原SQL：

select *
from tpch.customer
where customer.c_custkey in (
          select orders.o_custkey
          from tpch.orders
          where orders.O_ORDERDATE >= current_date - interval '1' YEAR)

应用重写优化，转换为：

select  *
from tpch.customer
where exists (select  orders.o_custkey
            from tpch.orders
            where orders.O_ORDERDATE >= current_date - interval '1' YEAR
        and orders.o_custkey = customer.c_custkey)

基于转换后的SQL，推荐索引：

CREATE INDEX PAW_IDX1072908633 ON tpch.ORDERS(O_ORDERDATE,O_CUSTKEY);
-- 当QB_2中引用的表ORDERS作为驱动表时, 索引PAW_IDX1072908633可以被用来进行索引范围查找,过滤条件为(orders.O_ORDERDATE >= current_date - interval '1' YEAR); 该索引是个覆盖索引，可以避免回表.

性能验证：

执行计划（优化前）

-> Nested loop inner join  (cost=65987720.69 rows=659855821)
 -> Table scan on customer  (cost=1149.80 rows=9888)
 -> Single-row index lookup on  using  (o_custkey=customer.C_CUSTKEY)
     -> Materialize with deduplication  (cost=13874.51..13874.51 rows=66733)
         -> Filter: (orders.O_ORDERDATE >= ((curdate() - interval '1' year)))  (cost=7201.21 rows=66733)
             -> Table scan on orders  (cost=7201.21 rows=200219)

执行计划（优化后）

-> Nested loop inner join  (cost=3771444.20 rows=37693056)
 -> Table scan on customer  (cost=1149.80 rows=9888)
 -> Single-row index lookup on  using  (o_custkey=customer.C_CUSTKEY)
     -> Materialize with deduplication  (cost=1150.65..1150.65 rows=3812)
         -> Filter: (orders.O_ORDERDATE >= ((curdate() - interval '1' year)))  (cost=769.45 rows=3812)
             -> Covering index range scan on orders using PAW_IDX1072908633 over ('2022-03-28' <= O_ORDERDATE)  (cost=769.45 rows=3812)

本次优化实施后，预计本SQL的性能将提升 1648.67%。

IN子查询转换为内连接

原SQL，c_custkey是customer表的主键。

select *
  from tpch.orders
  where orders.o_custkey in (
             select customer.c_custkey
             from tpch.customer)

应用重写优化，转化为内连接。

select orders.*
  from tpch.orders, tpch.customer
  where customer.c_custkey = orders.o_custkey

基于转换后的SQL，推荐索引。

CREATE INDEX PAW_IDX0455857015 ON tpch.ORDERS(O_CUSTKEY,O_CLERK);
  -- 当ORDERS作为被驱动表时, 索引PAW_IDX0455857015可以被用来进行索引查找, 过滤条件为(customer.c_custkey = orders.o_custkey).

性能验证。

执行计划（优化前）

-> Nested loop inner join  (cost=240790.71 rows=200219)
 -> Table scan on orders  (cost=20549.81 rows=200219)
 -> Single-row covering index lookup on customer using key_idx (C_CUSTKEY=orders.O_CUSTKEY)  (cost=1.00 rows=1)

执行计划（优化后）

-> Nested loop inner join  (cost=21289.23 rows=53135)
 -> Index scan on customer using key_idx  (cost=1149.80 rows=9888)
 -> Index lookup on orders using PAW_IDX0455857015 (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.50 rows=5)

本次优化实施后，预计本SQL的性能将提升 1064.60%

关于PawSQL

PawSQL专注数据库性能优化的自动化和智能化，支持MySQL，PostgreSQL，Opengauss等，提供的SQL优化产品包括

PawSQL Cloud，在线自动化SQL优化工具，支持SQL审查，智能查询重写、基于代价的索引推荐，适用于数据库管理员及数据应用开发人员。
PawSQL Advisor，IntelliJ 插件, 适用于数据应用开发人员，可以IDEA/DataGrip应用市场通过名称搜索“PawSQL Advisor”安装。
PawSQL Engine, 是PawSQL系列产品的后端优化引擎，可以独立安装部署，并通过http/json的接口提供SQL优化服务。PawSQL Engine以docker镜像的方式提供部署安装。

文章详情

数据库性能优化之IN子查询优化

问题定义

IN子查询

IN子查询的伪代码实现逻辑：

时间复杂度

EXISTS子查询

实现逻辑如下：

时间复杂度

Join方式

数据库中的IN子查询优化

PawSQL中的IN子查询优化

IN子查询转换为EXISTS

IN子查询转换为内连接

关于PawSQL

软考中级精品资料免费领

相关文章

猜你喜欢

数据库性能优化之IN子查询优化

数据库查询优化之子查询优化的示例分析

Mysql查询优化之IN子查询优化方法详解

PHP接口性能优化之数据库查询优化策略（如何优化PHP接口中的数据库查询以提升性能？）

数据库查询性能优化指南

Golang 技术性能优化中如何优化数据库查询？

MySQL数据库查询性能优化策略

MySQL之select in 子查询优化的实现

怎么优化NoSQL数据库的查询性能

如何优化PHP中的数据库查询性能

如何通过查询优化MySQL数据库性能

MySQL性能调优之查询优化的方法

mysql-查询性能优化

如何通过php函数优化数据库查询性能？

如何优化 PHP 函数中数据库查询的性能？

MySQL优化 - 性能分析与查询优化

PHP 性能优化：数据库优化指南

MySQL 大数据in查询该怎么优化

SQL优化教程之in与range查询

SQL查询优化-MySQL 性能调优