• Clickhouse数据库部署、Python3压测实践


    Clickhouse数据库部署、Python3压测实践

    一、Clickhouse数据库部署
    • 版本:yandex/clickhouse-server:latest

    • 部署方式:docker

    • 内容

      version: "3"
      
      services:
        clickhouse:
          image: yandex/clickhouse-server:latest
          container_name: clickhouse    
          ports:
            - "8123:8123"
            - "9000:9000"
            - "9009:9009"
            - "9004:9004"
          volumes:
            - ./data/config:/var/lib/clickhouse
          ulimits:
            nproc: 65535
            nofile:
              soft: 262144
              hard: 262144
          healthcheck:
            test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
            interval: 30s
            timeout: 5s
            retries: 3
          deploy:
            resources:
              limits:
                cpus: '4'
                memory: 4096M
              reservations:
                memory: 4096M
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
    • 建表语句

      CREATE TABLE test_table (id int,
          feild1 String, feild2 String, feild3 String
          , feild4 String, feild5 String, feild6 String
          , feild7 String, feild8 String, feild9 String
          , feild10 String, feild11 String, feild12 String
          , feild13 String, feild14 String, feild15 String
          , feild16 String, feild17 String, feild18 String
          , feild19 String, feild20 String
          ) ENGINE = MergeTree:
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
    二、Python3插入数据压测
    • 关键库:clickhouse_driver、 concurrent.futures

    • 代码:

      import random
      import time
      from clickhouse_driver import Client
      from concurrent.futures import ThreadPoolExecutor, as_completed
      
      
      client = Client(host='ip')
      
      # 采用多个连接,避免单个连接被打死
      clients = [
          Client(host='ip'),
          Client(host='ip'),
          Client(host='ip'),
          Client(host='ip')
      ]
      
      
      # 采用批量插入,经过测试,单条并发插入支持差,每秒只能执行2-5次insert
      def task(i):
          sql = "INSERT INTO ck_table (id, feild1, feild2,feild3,feild4,feild5,feild6,feild7,feild8,feild9,feild10,feild11,feild12,feild13,feild14,feild15,feild16,feild17,feild18,feild19,feild20) VALUES"
          values = []
          for i in range(1000):
              values.append((random.randint(1,10000000),"feild1-"+str((random.randint(1,10000000))),"feild2-"+str(i),"feild3-"+str(i), "feild4-"+str(i), "feild5-"+str(i), "feild6-"+str(i), "feild7-"+str(i)
                             , "feild8-"+str(i), "feild9-"+str(i), "feild10-"+str(i), "feild11-"+str(i), "feild12-"+str(i), "feild13-"+str(i), "feild14-"+str(i)
                             , "feild15-"+str(i), "feild16-"+str(i), "feild17-"+str(i), "feild18-"+str(i), "feild19-"+str(i)
                             , "feild20-"+str(i)
                             ))
          clid = random.randint(1, len(clients)-1)
          clients[clid].execute(sql, values)
          return '第',clid, "插入",i, '条数据成功'
      
      
      if __name__ == '__main__':
          print ("程序开始运行")
          exec = ThreadPoolExecutor(max_workers=2)
          #ress = []
          start_time = time.perf_counter()
          for j in range(4000000):  # 总共需要执行的次数
              res = exec.submit(task,j)
              #ress.append(res)
          # for i in as_completed(ress):
          #     print("执行状态",i.result())
          print("执行耗时", time.perf_counter()-start_time,"s")
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
    三、Python3查询数据测试
    • 关键库:clickhouse_driver、 concurrent.futures

    • 代码

      import time
      from concurrent.futures import ThreadPoolExecutor, as_completed
      from clickhouse_driver import Client
      
      client = Client(host='10.10.16.110')
      
      query_sql = """select * from ck_table where feild2='feild2-1009' """
      
      
      def new_task(i):
          count_sql = """ select count(*) from ck_table"""
          time.sleep(1)
          return "执行第",i,"个任务",client.execute(count_sql)
      
      
      if __name__ == '__main__':
          print ("程序开始运行")
          thd_ques = []
          exec = ThreadPoolExecutor(max_workers=1)
          ress = []
          start_time = time.perf_counter()
          for j in range(1000):
              res = exec.submit(new_task,j)
              ress.append(res)
          for i in as_completed(ress):
              print("执行状态",i.result())
          print("执行耗时", time.perf_counter()-start_time,"s")
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
    四、测试结论

    clickhouse:21个字段表插入-查询测试, CPU200w数据以内 >100,峰值:133.6, 均值: 约110

    • 1、不支持频繁插入(一般1-2次/s),否则会断联等报错,只能批插入(脚本使用2协程每次1000条没有报错,2个协程或者以上会出现断联等报错)

    • 2、不支持频发查询,QPS官方建议100以内,否则CPU占用会很高,拉高服务器负载

    • 3、查询效率:

      • 一个条件where查询(Memery):60W 0.33s

      • 5个条件where查询(Memery):80W 0.57s

      • 5个条件where查询(Memery):100W 0.54s

      • 5个条件where查询(Memery):112W 0.56s

      • 5个条件where查询(Memery):200W 0.565s

      • 5个条件where查询(Memery):500W 1.2s(停止插入的情况下)

      • 5个条件where查询(Memery):560W 1.97s(停止插入的情况下)

      • 5个条件where查询(TinyLog):7000W条 1分47秒

      • 2个条件where查询(TinyLog):1亿零460万条 89s

      • 5个条件where查询(TinyLog):1亿零460万条 84s

      • 10个条件where查询(TinyLog):1亿零460万条 87s

    备注 450w条数据后,数据插入线程和查询线程只能存在一个,慢查询的内存消耗很高,16G内存不够用。5个条件where查询还能执行,在1-2s

    • (1)500w数据量服务器情况:(COPU均值在320左右,16G内存剩余在500-800M之间,停止写入/查询后,CPU恢复正常水平,内存剩余在800M左右)

      total used free shared buff/cache available

      15G 5.9G 519M 9.2M 9.1G 9.2G

      %CPU %MEM

      429.5 26.0

    • (2)1亿数据量服务器情况(1T磁盘消耗共38%,预计消耗6% )

      total used free shared buff/cache available

      15G 2.7G 181M 9.2M 12G 12G

      %CPU %MEM

      103.7 3.6

    总结:

    • 1、不支持并发单条频繁插入,否则会报错,断联等造成数据丢失
    • 2、不支持高并发查询,官方建议QPS<= 100,否则会增加服务器负载,CPU,内存等消耗过高
    • 3、对服务器要求高,亿级CPU一般建议16核心以上,内存64G以上
    • 4、优点是查询快,批量插入效率高,建议低频大批量插入
  • 相关阅读:
    年产3000吨冲压型果味硬糖生产车间工艺设计
    携创教育:成考迫近!你还在犹豫吗?
    Java常用类的使用
    【05】FISCOBCOS中的节点配置
    java注解简单介绍
    SDUT—Python程序设计实验9(模块与包)
    JDK 自带的服务发现框架 ServiceLoader 好用吗?
    Vue响应式内容丢失处理
    java八股文面试[数据库]——explain
    ES集群搭建_使用docker
  • 原文地址:https://blog.csdn.net/weixin_43563169/article/details/134058827