Impala进阶

1、Impala的负载均衡

Impala主要有三个组件，分别是statestore，catalog和impalad，对于Impalad节点，每一个节点都可以接收客户端的查询请求，并且对于连接到该Impalad的查询还要作为Coordinator节点（需要消耗一定的内存和CPU）存在，为了保证每一个节点的资源开销的平衡需要对于集群中的Impalad节点做一下负载均衡：

Cloudera官方推荐的代理方案:HAProxy
DNS做负载均衡
DNS做负载均衡方案是最简单的，但是性能一般，所以这里我们按照官方的建议使用HAProxy实现负载均衡
生产中应该选择一个非Impalad节点作为HAProxy的安装节点

HAProxy方案

（1）安装haproxy

yum install haproxy -y

（2）配置文件

vim /etc/haproxy/haproxy.cfg

（3）具体配置内容


#---------------------------------------------------------------------
# Example configuration for a possible web application. See the
# full configuration options online.
#
# http://haproxy.1wt.eu/download/1.4/doc/configuration.txt
#
#---------------------------------------------------------------------
 
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
     log 127.0.0.1 local2
     chroot /var/lib/haproxy
     pidfile /var/run/haproxy.pid
     maxconn 4000
     user haproxy
     group haproxy
     daemon
 
    # turn on stats unix socket
     stats socket /var/lib/haproxy/stats
 
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
     mode http#mode { tcp|http|health }，tcp 表示4层，http表示7层，health仅作为健康检查使⽤
     log                 global
     option              httplog
     option              dontlognull
     #option             http-server-close
     #option             forwardfor except 127.0.0.0/8
     #option             abortonclose##连接数过⼤⾃动关闭
     option              redispatch#如果失效则强制转换其他服务器
     retries             3#尝试3次失败则从集群摘除
     timeout             http-request 10s
     timeout  queue      1m
     #timeout connect    10s
     #timeout client     1m
     #timeout server     1m
     timeout connect     1d   #连接超时时间，重要，hive查询数据能返回结果的保证
     timeout client      1d   #同上
     timeout server      1d   #同上
     timeout http-keep-alive 10s
     timeout check 10s   #健康检查时间
     maxconn 3000   #最⼤连接数
 
listen status #定义管理界⾯
     bind 0.0.0.0:1080 #管理界⾯访问IP和端⼝
     mode http #管理界⾯所使⽤的协议
     option httplog
     maxconn 5000 #最⼤连接数
     stats refresh 30s #30秒⾃动刷新
     stats uri /stats
 
listen impalashell
     bind 0.0.0.0:25003 #ha作为proxy所绑定的IP和端⼝
     mode tcp #以4层⽅式代理，重要
     option tcplog
     balance roundrobin  #调度算法 'leastconn' 最少连接数分配，或者 'roundrobin'，轮询分
     server impalashell_1 linux121:21000 check
     server impalashell_2 linux122:21000 check
     server impalashell_3 linux123:21000 check
 
listen impalajdbc
     bind 0.0.0.0:25004 #ha作为proxy所绑定的IP和端⼝
     mode tcp #以4层⽅式代理，重要
     option tcplog
     balance roundrobin #调度算法 'leastconn' 最少连接数分配，或者 'roundrobin'，轮询分
     server impalajdbc_1 linux121:21050 check
     server impalajdbc_2 linux122:21050 check
     server impalajdbc_3 linux122:21050 check
 
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
frontend main *:5000
     acl url_static path_beg -i /static /images /javascript /stylesheets
     acl url_static path_end -i .jpg .gif .png .css .js
     use_backend static if url_static
     default_backend app
 
#---------------------------------------------------------------------
# static backend for serving up images, stylesheets and such
#---------------------------------------------------------------------
backend static
     balance roundrobin
     server static 127.0.0.1:4331 check
 
#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
backend app
     balance roundrobin
     server app1 127.0.0.1:5001 check
     server app2 127.0.0.1:5002 check
     server app3 127.0.0.1:5003 check
     server app4 127.0.0.1:5004 check

（4）启动

开启： service haproxy start

关闭： service haproxy stop

重启： service haproxy restart

（5）使用

Impala-shell访问方式

impala-shell -i linux123:25003

使用起来十分方便，区别仅仅相当于是修改了一个ip地址和端口而已，其余不变。

jdbc:hive2://linux123:25004/default;auth=noSasl

Impala集群在操作过程中尽量多给内存，如果内存不能满足使用要求，Impala的执行很可能会报错！！

2、Impala优化

文件格式：对于大数据量来说，Parquet文件格式是最佳的
避免小文件：insert ... values 会产生大量小文件，避免使用
合理分区粒度：利用分区可以在查询的时候忽略掉无用数据，提高查询效率，通常建议分区数量在3万以下 (太多的分区也会造成元数据管理的性能下降)
分区列数据类型最好是整数类型：分区列可以使用string类型，因为分区列的值最后都是作为HDFS目录使用，如果分区列使用整数类型可以降低内存消耗
获取表的统计指标：在追求性能或者大数据量查询的时候，要先获取所需要的表的统计指标 (如:执行compute stats )
减少传输客户端数据量
- 聚合(如 count、sum、max 等)
- 过滤(如 WHERE ) limit限制返回条数
- 返回结果不要使用美化格式进行展示(在通过impala-shell展示结果时，添加这些可选参数: - B、 --output_delimiter )
在执行之前使用EXPLAIN来查看逻辑规划，分析执行逻辑
Impala join自动的优化手段就是通过使用COMPUTE STATS来收集参与Join的每张表的统计信息，然后由Impala根据表的大小、列的唯一值数目等来自动优化查询。为了更加精确地获取每张表的统计信息，每次表的数据变更时(如执行Insert,add partition,drop partition等)最好都要执行一遍COMPUTE STATS获取到准确的表统计信息。

相关阅读:
C++【string类】
微服务的快速开始（nacos）最全快速配置图解
 阿里巴巴Java方向面试题汇总（含答案）
Vue-2.1scoped样式冲突
 候选公示！高工智能汽车金球奖首批入围年度产品/方案亮相
 【项目开发 | C语言项目 | C语言病人管理系统】
java项目技术方案——书写示例
 ORACLE 11.2.0.4 RAC Cluster not starting cssd with Cannot get GPnP profile
Spring基础(3)：复习
 torch.nn.functional.grid_sample（F.grid_sample）函数的说明 & 3D空间中的点向图像投影的易错点
原文地址：https://blog.csdn.net/weixin_52851967/article/details/127545927