以下是同事人员发过来的sql,让我帮他优化一下,clickhouse的语句,需要替换成在自己的数据库上执行,dth_diagnosis_all 这个表的数据量90217505 ,千万级的数据量,说是内存不足,sql执行不了。
-
- select sum(case when re.diagnosis_code = 'E10' then 1 else 0 end) as E10Count,
- sum(case when re.diagnosis_code = 'E11' then 1 else 0 end) as E11Count,
- sum(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then 1 else 0 end) as otherCount
- from (select t1.id_no as id_no, left(t1.diagnosis_code,3) as diagnosis_code
- from (select t2.id_no, t2.org_code, t2.event_type_code, t2.org_treatment_no,
- max(t2.register_date) as register_date,t2.diagnosis_code
- from dth.dth_diagnosis_all t2
- where t2.register_date <= #{initYear}
- and left(t2.diagnosis_code,3) in('E10','E11','E12','E13','E14')
- group by t2.id_no,t2.org_code,t2.event_type_code,t2.org_treatment_no,t2.diagnosis_code)t3,
- dth.dth_diagnosis_all t1
- where t1.id_no = t3.id_no
- and t1.org_code = t3.org_code and t1.diagnosis_code = t3.diagnosis_code
- and t1.event_type_code = t3.event_type_code
- and t1.org_treatment_no = t3.org_treatment_no
- and t1.register_date = t3.register_date
- and t1.org_province = #{province}
- and t1.org_city = #{city}
- )re left join (select * from dth_ehr_pir_all
- where birth_date <= #{initYear}
- and register_province = #{province}
- and register_city = #{city}
- and diabetes_flag != '0'
- and (death_date = ''
- or death_date > #{initYear})) t
- on t.id_no = re.id_no
- where t.birth_date <= #{initYear}
- and t.register_province = #{province}
- and t.register_city = #{city}
- and t.diabetes_flag != '0'
- and (t.death_date = '' or t.death_date > #{initYear})
- 参数:initYear(查询年份12月31,如2029-12-31), #{province},查询省份:(山东省), #{city} :查询城市
两个子查询的数据量
以下为优化后的sql,看这个sql子查询t1和t3,其实是同一张表,然后对同一张表做连接查询,对t3表做了where条件过滤(这个不用改),之后连接查询之后又对t1表进行了where条件过滤,如下:
t1.org_province = #{province}
and t1.org_city = #{city}
为啥不提前提到t1子查询里面去呢,
再看一下join条件中
and t1.diagnosis_code = t3.diagnosis_code
and t1.register_date = t3.register_date,
说明t3和t1连接条件是一样的,又是同一张表,t3中也对这两个列做了where限制,那么子查询t1中也可以加入这两个条件,于是把一下四个条件提到t1的子查询中去,大大降低了临时表占用的内存空间,
t1.org_province = #{province}
and t1.org_city = #{city}
and register_date <= #{initYear}
and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
- (
- select *
- from dth.dth_diagnosis_all
- where
- org_province = #{province} and org_city = #{city}
- and register_date <= #{initYear}
- and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
- ) t1
然后就是针对 t 表做优化,子查询中已经对 t 表做了where条件过滤,连接查询后面又对 t 表进行单独的过滤,这个完全没必要再加同样的条件了。
之后再看最外层的
sum(case when re.diagnosis_code = 'E10' then 1 else 0 end) as E10Count,
sum(case when re.diagnosis_code = 'E11' then 1 else 0 end) as E11Count,
sum(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then 1 else 0 end) as otherCount
这个一看就是对条件进行统计个数,可以改成效率更高的count
改下如下:
count(case when re.diagnosis_code = 'E10' then '1' else null end) as E10Count,
count(case when re.diagnosis_code = 'E11' then '1' else null end) as E11Count,
count(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then '1' else null end) as otherCount
至此改完之后的结果
- select
- count(case when re.diagnosis_code = 'E10' then '1' else null end) as E10Count,
- count(case when re.diagnosis_code = 'E11' then '1' else null end) as E11Count,
- count(case when re.diagnosis_code != 'E10' and re.diagnosis_code != 'E11' then '1' else null end) as otherCount
- from
- (
- select
- t1.id_no as id_no,
- left(t1.diagnosis_code,3) as diagnosis_code
- from
- (
- select
- t2.id_no, t2.org_code, t2.event_type_code, t2.org_treatment_no,
- max(t2.register_date) as register_date,
- t2.diagnosis_code
- from dth.dth_diagnosis_all t2
- where
- t2.register_date <= #{initYear}
- and left(t2.diagnosis_code,3) in('E10','E11','E12','E13','E14')
- group by
- t2.id_no,t2.org_code,t2.event_type_code,t2.org_treatment_no,t2.diagnosis_code
- ) t3
- inner join
- (
- select *
- from dth.dth_diagnosis_all
- where
- org_province = #{province} and org_city = #{city}
- and register_date <= #{initYear}
- and left(diagnosis_code,3) in('E10','E11','E12','E13','E14')
- ) t1
- on
- t1.id_no = t3.id_no
- and t1.org_code = t3.org_code
- and t1.diagnosis_code = t3.diagnosis_code
- and t1.event_type_code = t3.event_type_code
- and t1.org_treatment_no = t3.org_treatment_no
- and t1.register_date = t3.register_date
- ) re
- left join
- (
- select *
- from dth_ehr_pir_all
- where
- birth_date <= #{initYear}
- and register_province = #{province}
- and register_city = #{city}
- and diabetes_flag != '0'
- and (death_date = '' or death_date > #{initYear})
- ) t
- on t.id_no = re.id_no;
sql能跑出来了,不会报出内存不足了