ClickHouse 的 SQL 优化规则是基于RBO(Rule Based Optimization),下面是一些优化规则
1)上传官方的数据集
将visits_v1.tar和hits_v1.tar上传到虚拟机,解压到clickhouse数据路径下
// 解压到clickhouse数据路径
- sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse
- sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse
-
- //修改所属用户
- sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
- sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
2)重启clickhouse-server
sudo clickhouse restart
3)执行查询
clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
注意:官方的tar包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表,最方便。
hits_v1表有130多个字段,880多万条数据
visits_v1表有180多个字段,160多万条数据
在调用 count 函数时,如果使用的是 count() 或者 count(*),且没有 where 条件,则会直接使用 system.tables 的 total_rows,例如:
- EXPLAIN SELECT count()FROM datasets.hits_v1;
-
- Union
- Expression (Projection)
- Expression (Before ORDER BY and SELECT)
- MergingAggregated
- ReadNothing (Optimized trivial count)
注意 Optimized trivial count ,这是对 count 的优化。
如果 count 具体的列字段,则不会使用此项优化:
- EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;
-
- Union
- Expression (Projection)
- Expression (Before ORDER BY and SELECT)
- Aggregating
- Expression (Before GROUP BY)
- ReadFromStorage (Read from MergeTree)
下面语句子查询中有两个重复的 id 字段,会被去重:
- EXPLAIN SYNTAX SELECT
- a.UserID,
- b.VisitID,
- a.URL,
- b.UserID
- FROM
- hits_v1 AS a
- LEFT JOIN (
- SELECT
- UserID,
- UserID as HaHa,
- VisitID
- FROM visits_v1) AS b
- USING (UserID)
- limit 3;
-
- //返回优化语句:
- SELECT
- UserID,
- VisitID,
- URL,
- b.UserID
- FROM hits_v1 AS a
- ALL LEFT JOIN
- (
- SELECT
- UserID,
- VisitID
- FROM visits_v1
- ) AS b USING (UserID)
- LIMIT 3
当group by有having子句,但是没有with cube、with rollup 或者with totals修饰的时候,having过滤会下推到where提前过滤。例如下面的查询,HAVING name变成了WHERE name,在group by之前过滤:
- EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = '8585742290196126178';
-
- //返回优化语句
- SELECT UserID
- FROM hits_v1
- WHERE UserID = \'8585742290196126178\'
- GROUP BY UserID
子查询也支持谓词下推:
- EXPLAIN SYNTAX
- SELECT *
- FROM
- (
- SELECT UserID
- FROM visits_v1
- )
- WHERE UserID = '8585742290196126178'
-
- //返回优化后的语句
- SELECT UserID
- FROM
- (
- SELECT UserID
- FROM visits_v1
- WHERE UserID = \'8585742290196126178\'
- )
- WHERE UserID = \'8585742290196126178\'
再来一个复杂例子:
- //返回优化后的语句
- SELECT UserID
- FROM
- (
- SELECT UserID
- FROM
- (
- SELECT UserID
- FROM visits_v1
- WHERE UserID = \'8585742290196126178\'
- )
- WHERE UserID = \'8585742290196126178\'
- UNION ALL
- SELECT UserID
- FROM
- (
- SELECT UserID
- FROM visits_v1
- WHERE UserID = \'8585742290196126178\'
- )
- WHERE UserID = \'8585742290196126178\'
- )
- WHERE UserID = \'8585742290196126178\'
聚合函数内的计算,会外推,例如:
- EXPLAIN SYNTAX
- SELECT sum(UserID * 2)
- FROM visits_v1
-
- //返回优化后的语句
- SELECT sum(UserID) * 2
- FROM visits_v1
如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:
- EXPLAIN SYNTAX
- SELECT
- sum(UserID * 2),
- max(VisitID),
- max(UserID)
- FROM visits_v1
- GROUP BY UserID
-
- //返回优化后的语句
- SELECT
- sum(UserID) * 2,
- max(VisitID),
- UserID
- FROM visits_v1
- GROUP BY UserID
例如下面的语句,重复的聚合键 id 字段会被去重:
- EXPLAIN SYNTAX
- SELECT *
- FROM visits_v1
- ORDER BY
- UserID ASC,
- UserID ASC,
- VisitID ASC,
- VisitID ASC
-
-
- //返回优化后的语句:
- select
- ……
- FROM visits_v1
- ORDER BY
- UserID ASC,
- VisitID ASC
例如下面的语句,重复声明的 name 字段会被去重:
- EXPLAIN SYNTAX
- SELECT *
- FROM visits_v1
- LIMIT 3 BY
- VisitID,
- VisitID
- LIMIT 10
-
-
- //返回优化后的语句:
- select
- ……
- FROM visits_v1
- LIMIT 3 BY VisitID
- LIMIT 10
例如下面的语句,重复的关联键 id 字段会被去重:
- EXPLAIN SYNTAX
- SELECT
- a.UserID,
- a.UserID,
- b.VisitID,
- a.URL,
- b.UserID
- FROM hits_v1 AS a
- LEFT JOIN visits_v1 AS b USING (UserID, UserID)
-
-
- //返回优化后的语句:
- SELECT
- UserID,
- UserID,
- VisitID,
- URL,
- b.UserID
- FROM hits_v1 AS a
- ALL LEFT JOIN visits_v1 AS b USING (UserID)
如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:
- EXPLAIN SYNTAX
- WITH
- (
- SELECT sum(bytes)
- FROM system.parts
- WHERE active
- ) AS total_disk_usage
- SELECT
- (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
- table
- FROM system.parts
- GROUP BY table
- ORDER BY table_disk_usage DESC
- LIMIT 10;
-
-
- //返回优化后的语句:
- WITH CAST(0, \'UInt64\') AS total_disk_usage
- SELECT
- (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
- table
- FROM system.parts
- GROUP BY table
- ORDER BY table_disk_usage DESC
- LIMIT 10
如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:
- EXPLAIN SYNTAX
- SELECT number = 1 ? 'hello' : (number = 2 ? 'world' : 'atguigu')
- FROM numbers(10)
- settings optimize_if_chain_to_multiif = 1;
-
- // 返回优化后的语句:
- SELECT multiIf(number = 1, \'hello\', number = 2, \'world\', \'atguigu\')
- FROM numbers(10)
- SETTINGS optimize_if_chain_to_multiif = 1