Prometheus集成AlertManager实现告警

Prometheus Server配置

使用yml格式编写一个告警规则配置文件

groups:
- name: 账号中心
  rules:
  # 检测状态报警
  - alert: 账号中心指标状态告警
    expr: ssl_expire_days == 0
    for: 0s
    labels:
      severity: 1
    annotations:
      instance: "账号中心 实例 {{$labels.instance}} 指标告警"
      description: "账号中心 实例{{$labels.instance}} 域名证书剩余值为：{{$value}}"
1
2
3
4
5
6
7
8
9
10
11
12

通过yml文件配置prometheus触发的告警规则。

修改prometheus.yml 配置文件，配置alertmanager告警地址以及告警规则文件

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "accountcenter.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: "nodeExporter"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["192.168.240.130:9100"] #监控自己主机上的端口
  - job_name: "springboot"
    scrape_interval: 3s                                                # 多久采集一次数据
    scrape_timeout: 3s                                                 # 采集时的超时时间
    metrics_path: '/actuator/prometheus'                # 采集的路径
    static_configs:                                     # 采集服务的地址，设置成Springboot应用所在服务器的具体地址
      - targets: ["192.168.1.103:8188"]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

重新加载prometheus配置文件

curl -X POST http://127.0.0.1:9090/-/reload

SpringBoot - WebHook

编写一个SpringBoot的控制器，用于AlertManager告警时触发的回调

@RestController
@RequestMapping("/alertmanager")
public class AlertManagerWebHooks {

    @RequestMapping("/hook")
    public Object hook(@RequestBody String body){
        System.out.println("接受到告警信息："+body);
        System.out.println("告警信息发送到数据库。。。");
        return "success";
    }

}
1
2
3
4
5
6
7
8
9
10
11
12

AlertManager配置

修改alertmanager.yml配置文件，添加webhook配置

# 全局配置,全局配置，包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
global:
  # 告警超时时间
  resolve_timeout: 5m
# 路由配置,设置报警的分发策略，它是一个树状结构，按照深度优先从左向右的顺序进行匹配。
route:
  # 用于将传入警报分组在一起的标签。
  # 基于告警中包含的标签，如果满足group_by中定义标签名称，那么这些告警将会合并为一个通知发送给接收器。
  group_by: ['alertname']
  # 发送通知的初始等待时间
  group_wait: 1s
  # 在发送有关新警报的通知之前需要等待多长时间
  group_interval: 1s
  # 如果已发送通知，则在再次发送通知之前要等待多长时间，通常约3小时或更长时间
  repeat_interval: 5s
  # 接受者名称
  receiver: 'web.hook'
# 配置告警消息接受者信息，例如常用的 email、wechat、slack、webhook 等消息通知方式
receivers:
  # 接受者名称
  - name: 'web.hook'
    # webhook URL
    webhook_configs:
      - url: 'http://192.168.1.103:8188/alertmanager/hook'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

在receivers配置项中配置告警接收者的配置信息，可以配置邮件、企业微信以及自定义的webhooks，webhooks就是一个Http接口，当alertManager触发告警时，会自动调用配置的接口。

告警流程

Prometheus 定期执行配置的告警规则，如果有满足条件的PromQL，则根据每个告警的 for 配置项进行等待，如果等待了 for 指定的评估时间以后依然满足告警条件，则触发告警，此时Prometheus会向AlertManager发送告警。
AlertManager收到告警以后会对到达的告警进行分组、抑制以及去重等。
根据配置的receiver 调用相关的触发，例如WebHooks，企业微信，钉钉等。

相关阅读:
Python入门教程
 【ASeeker】Android 源码捞针，服务接口扫描神器
 举个栗子~Tableau 技巧（234）：实现山峰柱形图
 设计模式5、原型模式 Prototype
Oracle SQL执行计划操作（7）——排序相关操作
 arcgis pro中的底图
 threejs CSS3DRenderer添加标签并设置朝向摄像机
 9.5～10.5 GHz频段室内离体信道的测量与建模
 【linux命令讲解大全】045.网络数据分析利器：深度解读 tcpdump 抓包工具的使用方法
 基于机器学习之模型树短期负荷预测（Matlab代码实现）
原文地址：https://blog.csdn.net/qq_43750656/article/details/133271399