• GPDB vs CK - NYC taxi data 简单测试对比


    今天开始,我想做一个 GreenplumClickHouse 的一系列简单测试对比,使用CK官网提供的数据集,主要目的有2个:

    1. 简单体验一下 CK 在单表测试上的性能比 GP 好多少;
    2. 踩踩坑,积累一些二者对比测试经验。

    特别说明:由于二者的特点各不相同,该测试仅供大家简单参考,不作为任何评判标准。

    测试环境

    • 个人 MacBook 笔记本
    • 内存 32 GB
    • CPU 6 cores
    • 本地 SSD
    • GPDB 分配 5个 Segment Instance
    • CK 单 Server

    NYC taxi data 简单测试对比

    首先我们今天测试的参考内容源自这里 -> https://clickhouse.com/docs/en/tutorial

    1. CK 数据加载

    首先我们根据官网指导,创建表:

    CREATE TABLE trips
    (
        `trip_id` UInt32,
        `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
        `pickup_date` Date,
        `pickup_datetime` DateTime,
        `dropoff_date` Date,
        `dropoff_datetime` DateTime,
        `store_and_fwd_flag` UInt8,
        `rate_code_id` UInt8,
        `pickup_longitude` Float64,
        `pickup_latitude` Float64,
        `dropoff_longitude` Float64,
        `dropoff_latitude` Float64,
        `passenger_count` UInt8,
        `trip_distance` Float64,
        `fare_amount` Float32,
        `extra` Float32,
        `mta_tax` Float32,
        `tip_amount` Float32,
        `tolls_amount` Float32,
        `ehail_fee` Float32,
        `improvement_surcharge` Float32,
        `total_amount` Float32,
        `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
        `trip_type` UInt8,
        `pickup` FixedString(25),
        `dropoff` FixedString(25),
        `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
        `pickup_nyct2010_gid` Int8,
        `pickup_ctlabel` Float32,
        `pickup_borocode` Int8,
        `pickup_ct2010` String,
        `pickup_boroct2010` String,
        `pickup_cdeligibil` String,
        `pickup_ntacode` FixedString(4),
        `pickup_ntaname` String,
        `pickup_puma` UInt16,
        `dropoff_nyct2010_gid` UInt8,
        `dropoff_ctlabel` Float32,
        `dropoff_borocode` UInt8,
        `dropoff_ct2010` String,
        `dropoff_boroct2010` String,
        `dropoff_cdeligibil` String,
        `dropoff_ntacode` FixedString(4),
        `dropoff_ntaname` String,
        `dropoff_puma` UInt16
    )
    ENGINE = MergeTree
    PARTITION BY toYYYYMM(pickup_date)
    ORDER BY pickup_datetime;
    

    接下来,我们加载数据,CK 提供 S3 上的数据操作函数,可以一条命令下载并插入数据到表中,还是比较方便的:

    INSERT INTO trips
        SELECT * FROM s3(
            'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
            'TabSeparatedWithNames'
        )
    

    我家里的网络条件一般般,最后下载入库花了大概 4 分钟:

    Query id: 62eeb743-7f34-441a-8e8b-3bd7f82163db
    
    Ok.
    
    0 rows in set. Elapsed: 221.145 sec. Processed 2.00 million rows, 974.44 MB (9.04 thousand rows/s., 4.41 MB/s.)
    
    ClickHouse.localdomain :) select count(*) from trips;
    
    SELECT count(*)
    FROM trips
    
    Query id: 7c0d770f-a161-4ad5-a59b-b9320a192d90
    
    ┌─count()─┐
    │ 1999657 │
    └─────────┘
    
    1 row in set. Elapsed: 0.002 sec.
    
    ClickHouse.localdomain :)
    

    如果你家里的网络条件不好,也可以选择将文件下载到本地再入库,方法如下(上面如果已经将数据入库,请忽略这部分内容,直接看第2部分):

    先下载两个文件,然后再解压缩,导入 CK:

    wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz
    gunzip trips_1.gz
    gunzip trips_2.gz
    clickhouse-client --query "INSERT INTO trips FORMAT TabSeparatedWithNames" --max_insert_block_size=100000 --password < trips_1
    clickhouse-client --query "INSERT INTO trips FORMAT TabSeparatedWithNames" --max_insert_block_size=100000 --password < trips_2
    

    2. GP 数据加载

    同样,在 GP 里也创建表,因为 GP 中默认没有 ENUM 类型,所以这里索性直接使用 int 替代 CK 表结构中的 ENUM,其他数据类型也都做了简单的对应转换:

    CREATE TABLE trips
    (
        trip_id int,
        vendor_id text,
        pickup_date Date,
        pickup_datetime timestamp(0),
        dropoff_date Date,
        dropoff_datetime timestamp(0),
        store_and_fwd_flag smallint,
        rate_code_id smallint,
        pickup_longitude numeric,
        pickup_latitude numeric,
        dropoff_longitude numeric,
        dropoff_latitude numeric,
        passenger_count smallint,
        trip_distance numeric,
        fare_amount numeric,
        extra numeric,
        mta_tax numeric,
        tip_amount numeric,
        tolls_amount numeric,
        ehail_fee numeric,
        improvement_surcharge numeric,
        total_amount numeric,
        payment_type text,
        trip_type smallint,
        pickup char(25),
        dropoff char(25),
        cab_type text,
        pickup_nyct2010_gid smallint,
        pickup_ctlabel numeric,
        pickup_borocode smallint,
        pickup_ct2010 text,
        pickup_boroct2010 text,
        pickup_cdeligibil text,
        pickup_ntacode char(4),
        pickup_ntaname text,
        pickup_puma smallint,
        dropoff_nyct2010_gid smallint,
        dropoff_ctlabel numeric,
        dropoff_borocode smallint,
        dropoff_ct2010 text,
        dropoff_boroct2010 text,
        dropoff_cdeligibil text,
        dropoff_ntacode char(4),
        dropoff_ntaname text,
        dropoff_puma smallint,
        primary key(trip_id,pickup_datetime)
    );
    

    接下来我们将数据装载到 GP 中,因为数据量不大,为了简化步骤就不使用gpload或者gpfdist这种入库方式了,直接用最简单的 COPY 入库:

    下载数据并解压:

    wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz
    gunzip trips_1.gz
    gunzip trips_2.gz
    

    这里注意,我看了一下文件的编码格式,是 latin1的,所以需要先转换一下,改为 UTF8:
    在这里插入图片描述

    iconv -f latin1 -t utf8 -o trips_1.tsv trips_1
    iconv -f latin1 -t utf8 -o trips_2.tsv trips_2
    

    然后进入 psql 使用 COPY 加载数据:

    postgres=# select count(*) from trips;
     count
    -------
         0
    (1 row)
    
    postgres=# copy trips from '/home/odb1/trips_1' with header;
    ERROR:  invalid byte sequence for encoding "UTF8": 0x95
    HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
    CONTEXT:  COPY trips, line 4
    postgres=# copy trips from '/home/odb1/trips_1.tsv' with header;
    COPY 999832
    postgres=# copy trips from '/home/odb1/trips_2.tsv' with header;
    COPY 999825
    postgres=# select count(*) from trips;
      count
    ---------
     1999657
    (1 row)
    

    3. 对比测试及分析

    如果大家感觉到 SQL 对比执行过程较为繁琐,可以直接跳转到第4部分查看对比结果。

    3.1 count

    ClickHouse.localdomain :) SELECT count() FROM trips;
    
    SELECT count()
    FROM trips
    
    Query id: 8ac52d6d-3515-47ed-9afb-14efedaa1cb9
    
    ┌─count()─┐
    │ 1999657 │
    └─────────┘
    
    1 row in set. Elapsed: 0.002 sec.
    
    postgres=# SELECT count(*) FROM trips;
      count
    ---------
     1999657
    (1 row)
    
    Time: 75.702 ms
    

    分析:CK维护了 count 索引,count 简直是 PostgreSQL 生态数据库的痛啊,性能差了37倍。

    3.2 DISTINCT

    SELECT DISTINCT(pickup_ntaname) FROM trips;
    190 rows in set. Elapsed: 0.033 sec. Processed 2.00 million rows, 60.32 MB (61.25 million rows/s., 1.85 GB/s.)
    
    SELECT DISTINCT(pickup_ntaname) FROM trips;
    Time: 161.792 ms
    

    分析:GP 慢了将近5倍

    3.3 avg

    ClickHouse.localdomain :) SELECT avg(tip_amount) FROM trips
    
    SELECT avg(tip_amount)
    FROM trips
    
    Query id: 5d316394-9263-45c6-a8a3-d135701f71c3
    
    ┌────avg(tip_amount)─┐
    │ 1.6847585845085191 │
    └────────────────────┘
    
    1 row in set. Elapsed: 0.010 sec. Processed 2.00 million rows, 8.00 MB (193.95 million rows/s., 775.81 MB/s.)
    
    postgres=# SELECT avg(tip_amount) FROM trips;
            avg
    --------------------
     1.6847585810966581
    (1 row)
    
    Time: 114.005 ms
    

    分析:GP 慢了11倍

    3.4 group by + avg

    ClickHouse.localdomain :) SELECT
                                  passenger_count,
                                  ceil(avg(total_amount),2) AS average_total_amount
                              FROM trips
                              GROUP BY passenger_count;
    
    SELECT
        passenger_count,
        ceil(avg(total_amount), 2) AS average_total_amount
    FROM trips
    GROUP BY passenger_count
    
    Query id: 1c875b66-88bf-466c-aa66-bf474333dd03
    
    ┌─passenger_count─┬─average_total_amount─┐
    │               022.69 │
    │               115.97 │
    │               217.15 │
    │               316.76 │
    │               417.33 │
    │               516.35 │
    │               616.04 │
    │               759.8 │
    │               836.41 │
    │               99.81 │
    └─────────────────┴──────────────────────┘
    
    10 rows in set. Elapsed: 0.029 sec. Processed 2.00 million rows, 10.00 MB (68.40 million rows/s., 342.01 MB/s.)
    
    postgres=# SELECT
        passenger_count,
        round(avg(total_amount),2) AS average_total_amount
    FROM trips
    GROUP BY passenger_count;
     passenger_count | average_total_amount
    -----------------+----------------------
                   7 |                59.79
                   6 |                16.03
                   4 |                17.32
                   3 |                16.75
                   5 |                16.35
                   9 |                 9.80
                   8 |                36.40
                   1 |                15.97
                   2 |                17.14
                   0 |                22.68
    (10 rows)
    
    Time: 169.630 ms
    

    分析:GP 慢了将近6倍

    3.5 group by + order by

    SELECT
        pickup_date,
        pickup_ntaname,
        SUM(1) AS number_of_trips
    FROM trips
    GROUP BY pickup_date, pickup_ntaname
    ORDER BY pickup_date ASC;
    
    8751 rows in set. Elapsed: 0.032 sec. Processed 2.00 million rows, 64.32 MB (61.54 million rows/s., 1.98 GB/s.)
    
    SELECT
        pickup_date,
        pickup_ntaname,
        SUM(1) AS number_of_trips
    FROM trips
    GROUP BY pickup_date, pickup_ntaname
    ORDER BY pickup_date ASC;
    
    Time: 201.707 ms
    

    分析:GP 慢了6倍多

    3.6 filter + group by + order by + avg + count

    ClickHouse.localdomain :) SELECT
                                  avg(tip_amount) AS avg_tip,
                                  avg(fare_amount) AS avg_fare,
                                  avg(passenger_count) AS avg_passenger,
                                  count() AS count,
                                  truncate(date_diff('second', pickup_datetime, dropoff_datetime)/3600) as trip_minutes
                              FROM trips
                              WHERE trip_minutes > 0
                              GROUP BY trip_minutes
                              ORDER BY trip_minutes DESC
    
    SELECT
        avg(tip_amount) AS avg_tip,
        avg(fare_amount) AS avg_fare,
        avg(passenger_count) AS avg_passenger,
        count() AS count,
        truncate(dateDiff('second', pickup_datetime, dropoff_datetime) / 3600) AS trip_minutes
    FROM trips
    WHERE trip_minutes > 0
    GROUP BY trip_minutes
    ORDER BY trip_minutes DESC
    
    Query id: a9375802-23cf-408f-9173-faf08b70a23b
    
    ┌────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬─count─┬─trip_minutes─┐
    │ 0.9800000190734863101.52458 │
    │ 1.182367894984158714.4933779285902972.060200668896321149523 │
    │ 2.115957474454920623.228723404255322.46808510638297874722 │
    │ 1.121818152340975613.6818181818181821.90909090909090921121 │
    │ 0.321818183768879318.0454545454545472.36363636363636381120 │
    │ 2.149000000953674517.551.51019 │
    │  4.537058907396653371.76470588235294111718 │
    │  1.42500000618971330.388461773212141.42307692307692312617 │
    │  1.77049998641014120.052.22016 │
    │ 1.490869576516358722.7826086956521762.1304347826086962315 │
    │   1.4415789560267823.2099998875668162.31578947368421061914 │
    │  7.70846149554619427.5769230769230771.76923076923076921313 │
    │  2.81444440616501723.0722221798366971.94444444444444441812 │
    │  2.03133335113525417.7333333333333341.93333333333333331511 │
    │ 1.735999967157840822.2752.32010 │
    │ 1.960555563370386832.4861111111111142.3055555555555554369 │
    │ 1.963478246460790325.6086956521739141.9130434782608696238 │
    │ 2.581249999503294526.0416666666666681.5833333333333333247 │
    │ 1.550645155291403622.8387096774193562.225806451612903316 │
    │ 1.668125000142026743.468752.25325 │
    │ 0.589516128263165919.2525806042455870.6451612903225806624 │
    │ 1.077631584907832844.263157894736841.763157894736842383 │
    │ 5.287600017189979561.8355000000167651.8752002 │
    │  6.03933384444284454.977343044373471.8110571598943692117391 │
    └────────────────────┴────────────────────┴────────────────────┴───────┴──────────────┘
    
    24 rows in set. Elapsed: 0.034 sec. Processed 2.00 million rows, 33.99 MB (58.78 million rows/s., 999.18 MB/s.)
    
    ClickHouse.localdomain :)
    
    postgres=# select
      avg(tip_amount) AS avg_tip,
      avg(fare_amount) AS avg_fare,
      avg(passenger_count) AS avg_passenger,
      count(*) AS count,
      trip_minutes
    from
    (
          SELECT
            tip_amount,
            fare_amount,
            passenger_count,
            trunc(
                (date_part('hour', dropoff_datetime-pickup_datetime)*3600
                + date_part('minute', dropoff_datetime-pickup_datetime)*60
                + date_part('second', dropoff_datetime-pickup_datetime)
                )/3600
                ) as trip_minutes
          FROM trips
    )foo
    WHERE trip_minutes > 0
    group by trip_minutes
    ORDER BY trip_minutes DESC;
            avg_tip         |      avg_fare       |   avg_passenger   | count | trip_minutes
    ------------------------+---------------------+-------------------+-------+--------------
         1.1823678929765886 | 14.4933779264214047 |  2.06020066889632 |  1495 |           23
         2.1159574468085106 | 23.2287234042553191 |  2.46808510638298 |    47 |           22
         1.1218181818181818 | 13.6818181818181818 |  1.90909090909091 |    11 |           21
     0.32181818181818181818 | 18.0454545454545455 |  2.36363636363636 |    11 |           20
         2.1490000000000000 | 17.5500000000000000 |               1.5 |    10 |           19
         4.5370588235294118 | 37.0000000000000000 |  1.76470588235294 |    17 |           18
         1.4250000000000000 | 30.3884615384615385 |  1.42307692307692 |    26 |           17
         1.7705000000000000 | 20.0500000000000000 |               2.2 |    20 |           16
         1.4908695652173913 | 22.7826086956521739 |   2.1304347826087 |    23 |           15
         1.4415789473684211 | 23.2100000000000000 |  2.31578947368421 |    19 |           14
         7.7084615384615385 | 27.5769230769230769 |  1.76923076923077 |    13 |           13
         2.8144444444444444 | 23.0722222222222222 |  1.94444444444444 |    18 |           12
         2.0313333333333333 | 17.7333333333333333 |  1.93333333333333 |    15 |           11
         1.7360000000000000 | 22.2750000000000000 |               2.3 |    20 |           10
         1.9605555555555556 | 32.4861111111111111 |  2.30555555555556 |    36 |            9
         1.9634782608695652 | 25.6086956521739130 |  1.91304347826087 |    23 |            8
         2.5812500000000000 | 26.0416666666666667 |  1.58333333333333 |    24 |            7
         1.5506451612903226 | 22.8387096774193548 |   2.2258064516129 |    31 |            6
         1.6681250000000000 | 43.4687500000000000 |              2.25 |    32 |            5
     0.58951612903225806452 | 19.2525806451612903 | 0.645161290322581 |    62 |            4
         1.0776315789473684 | 44.2631578947368421 |  1.76315789473684 |    38 |            3
         5.2449504950495050 | 61.3222772277227723 |  1.87128712871287 |   202 |            2
         6.0393338444501235 | 54.9773430445523469 |  1.81105715989437 | 11739 |            1
    (23 rows)
    
    Time: 306.016 ms
    postgres=#
    

    分析:GP 慢了9倍

    3.7 group by + order by

    SELECT
        pickup_ntaname,
        toHour(pickup_datetime) as pickup_hour,
        SUM(1) AS pickups
    FROM trips
    WHERE pickup_ntaname != ''
    GROUP BY pickup_ntaname, pickup_hour
    ORDER BY pickup_ntaname, pickup_hour
    
    3120 rows in set. Elapsed: 0.045 sec. Processed 2.00 million rows, 68.32 MB (44.22 million rows/s., 1.51 GB/s.)
    
    SELECT
        pickup_ntaname,
        date_part('hour',pickup_datetime) as pickup_hour,
        SUM(1) AS pickups
    FROM trips
    WHERE pickup_ntaname != ''
    GROUP BY pickup_ntaname, pickup_hour
    ORDER BY pickup_ntaname, pickup_hour;
    Time: 298.298 ms
    

    分析:GP 慢了6倍多

    3.8 LaGuardia or JFK 机场数据计算

    SELECT
        pickup_datetime,
        dropoff_datetime,
        total_amount,
        pickup_nyct2010_gid,
        dropoff_nyct2010_gid,
        CASE
            WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
            WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
        END AS airport_code,
        EXTRACT(YEAR FROM pickup_datetime) AS year,
        EXTRACT(DAY FROM pickup_datetime) AS day,
        EXTRACT(HOUR FROM pickup_datetime) AS hour
    FROM trips
    WHERE dropoff_nyct2010_gid IN (132, 138)
    ORDER BY pickup_datetime;
    
    45299 rows in set. Elapsed: 0.034 sec. Processed 2.00 million rows, 27.99 MB (59.16 million rows/s., 828.16 MB/s.)
    
    SELECT
        pickup_datetime,
        dropoff_datetime,
        total_amount,
        pickup_nyct2010_gid,
        dropoff_nyct2010_gid,
        CASE
            WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
            WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
        END AS airport_code,
        EXTRACT(YEAR FROM pickup_datetime) AS year,
        EXTRACT(DAY FROM pickup_datetime) AS day,
        EXTRACT(HOUR FROM pickup_datetime) AS hour
    FROM trips
    WHERE dropoff_nyct2010_gid IN (132, 138)
    ORDER BY pickup_datetime;
    
    Time: 208.689 ms
    

    分析:GP 慢了6倍

    4.总结

    下面给大家总结了一张图,整体来看差距还是比较大的,CK 由于其单 server 单表的一些优秀特性,比如内存中索引的维护、稀疏索引、SIMD等,执行速度特别的快。
    在这里插入图片描述
    当然最后再强调一下,没有最好的数据库,只有最适合的数据库,数据库选型还是需要根据各自的场景去选,CK 的高性能很让人心动,但是对 ACID 支持的不好、多表关联存在先天的缺陷等问题,也会阻碍很多场景下的应用。

    如果对您有用,麻烦动动小手【关注】、【点赞】、【收藏】,谢谢~

    End~

  • 相关阅读:
    微信小程序:两层循环的练习,两层循环显示循环图片大图(大图显示、多层循环)
    安全员c证考试科目有几门----唐山海德教育安全员报考
    [附源码]计算机毕业设计基于JEE平台springboot技术的订餐系统
    USB Type-C数据线美国新标准UL9990报告检测项目
    javaWeb项目-房屋房租租赁系统功能介绍
    永恒之蓝漏洞复现
    Web3.0与机器学习
    Webapck 解决:[webpack-cli] Error: Cannot find module ‘vue-loader/lib/plugin‘ 的问题
    音质和音色一样吗?
    基于SSM实现高校应届生就业管理系统
  • 原文地址:https://blog.csdn.net/chrisy521/article/details/127062298