• prometheus使用missing-container-metrics监控pod


    一、简介

    Kubernetes 默认情况下使用 cAdvisor 来收集容器的各项指标,足以满足大多数人的需求,但还是有所欠缺,比如缺少对以下几个指标的收集:

    • OOM kill

    • 容器重启的次数

    • 容器的退出码

    missing-container-metrics 这个项目弥补了 cAdvisor 的缺陷,新增了以上几个指标,集群管理员可以利用这些指标迅速定位某些故障。例如,假设某个容器有多个子进程,其中某个子进程被 OOM kill,但容器还在运行,如果不对 OOM kill 进行监控,管理员很难对故障进行定位。

    二、安装

    官方提供了helm chart方式来进行安装,我们先添加helm仓库

    helm repo add missing-container-metrics https://draganm.github.io/missing-container-metrics
    

    把这个chart下载到本地,我们需要修改value.yaml文件

    1. [root@master-01 addons]# helm pull missing-container-metrics/missing-container-metrics
    2. [root@master-01 addons]# ls
    3. blackbox dingtalk harbor_exporter mysql-exporter prometheusalert rules servicemonitor victoriametrics
    4. blackbox-probe etcd missing-container-metrics-0.1.1.tgz process-exporter redis-exporter scheduler-controller-svc.yaml ssl-exporter
    5. [root@master-01 addons]# tar xf missing-container-metrics-0.1.1.tgz

    可配置项

    ParameterDescriptionDefault
    image.repository镜像名称dmilhdef/missing-container-metrics
    image.pullPolicy镜像拉取策略IfNotPresent
    image.tag镜像tagv0.21.0
    imagePullSecrets拉取镜像的secret[]
    nameOverride覆盖生成的图表名称。默认为 .Chart.Name。
    fullnameOverride覆盖生成的版本名称。默认为 .Release.Name。
    podAnnotationsPod 的Annotations{"prometheus.io/scrape": "true", "prometheus.io/port": "3001"}
    podSecurityContext为 pod 设置安全上下文
    securityContext为 pod 中的容器设置安全上下文
    resourcesPU/内存资源请求/限制{}
    useDocker从 Docker 获取容器信息,如果容器运行时为docker ,设置为truefalse
    useContainerd从 Containerd 获取容器信息,如果容器运行时为containers ,设置为truetrue

    我们这里修改missing-container-metrics/values.yaml中``useDockertrue`,然后安装

    1. [root@master-01 addons]# kubectl create namespace missing-container-metrics
    2. namespace/missing-container-metrics created
    3. [root@master-01 addons]# helm install missing-container-metrics missing-container-metrics -n missing-container-metrics
    4. NAME: missing-container-metrics
    5. LAST DEPLOYED: Tue Jul 6 10:47:35 2021
    6. NAMESPACE: missing-container-metrics
    7. STATUS: deployed
    8. REVISION: 1
    9. TEST SUITE: None
    10. [root@master-01 addons]# helm -n missing-container-metrics list
    11. NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
    12. missing-container-metrics missing-container-metrics 1 2021-07-06 10:47:35.261058822 +0800 CST deployed missing-container-metrics-0.1.1 0.21.0
    13. ##因为我只有一个节点,所以这里daemonset 就只有一个pod
    14. [root@master-01 addons]# kubectl get pod -n missing-container-metrics
    15. NAME READY STATUS RESTARTS AGE
    16. missing-container-metrics-s9cgk 1/1 Running 0 115s

    我们可以访问服务的3001端口查看metrics,例如

    1. [root@master-01 addons]# curl 100.67.79.150:3001/metrics
    2. # HELP container_last_exit_code Last exit code of the container
    3. # TYPE container_last_exit_code gauge
    4. container_last_exit_code{container_id="docker://0133fb5d739ba98b3985bdc7766fa200334bbbf29de9a61f98a463ec00de53de",container_short_id="0133fb5d739b",docker_container_id="0133fb5d739ba98b3985bdc7766fa200334bbbf29de9a61f98a463ec00de53de",image_id="docker-pullable://k8s.gcr.io/pause:3.2",name="k8s_POD_dns-autoscaler-565bf94d6c-dc6v4_kube-system_96437fe8-200c-4845-a7cc-a27790c6c5a7_0",namespace="kube-system",pod="dns-autoscaler-565bf94d6c-dc6v4"} 0
    5. container_last_exit_code{container_id="docker://0388ba15b0181fead17cfc3606a57aeef0a9b8b73cf3f97eb901565c8aa1702c",container_short_id="0388ba15b018",docker_container_id="0388ba15b0181fead17cfc3606a57aeef0a9b8b73cf3f97eb901565c8aa1702c",image_id="docker-pullable://sha256:e20d2ec0d0ed8ffd693b435af9f2943095a608440e3b845331d6d00344025455",name="k8s_victoriametrics_victoriametrics-0_kube-system_7b381d2c-791b-4e38-8cbb-43485afcb285_0",namespace="kube-system",pod="victoriametrics-0"} 0
    6. container_last_exit_code{container_id="docker://0400f7e29dab47304f97669cb52b5c7c9310fbb5c156c07d0dc9bfca6b8ee14d",container_short_id="0400f7e29dab",docker_container_id="0400f7e29dab47304f97669cb52b5c7c9310fbb5c156c07d0dc9bfca6b8ee14d",image_id="docker-pullable://k8s.gcr.io/pause:3.2",name="k8s_POD_csi-resizer-f6d66495f-s4vkv_longhorn-system_282278da-2638-4e26-8411-802bf57c1ed8_0",namespace="longhorn-system",pod="csi-resizer-f6d66495f-s4vkv"} 0
    7. container_last_exit_code{container_id="docker://04e2c60777ce277c62c7137f1d7b40d9c1523bb3edf9127efd357590f39ba79c",container_short_id="04e2c60777ce",docker_container_id="04e2c60777ce277c62c7137f1d7b40d9c1523bb3edf9127efd357590f39ba79c",image_id="docker-pullable://k8s.gcr.io/pause:3.2",name="k8s_POD_kube-state-metrics-859b6bf99-q8tdf_monitoring_529aa188-f7a0-4b5c-9608-cd8fc473ac8c_2",namespace="monitoring",pod="kube-state-metrics-859b6bf99-q8tdf"} 0

    服务公开了如下的指标:

    • container_restarts :容器的重启次数。

    • container_ooms :容器的 OOM 杀死数。这涵盖了容器 cgroup 中任何进程的 OOM 终止。

    • container_last_exit_code :容器的最后退出代码。

    每一个指标包含如下标签:

    • docker_container_id:容器的完整 ID。

    • container_short_id:Docker 容器 ID 的前 6 个字节。

    • container_id:容器 id 以与 kubernetes pod 指标相同的格式表示 - 以容器运行时为前缀docker://containerd://取决于容器运行时。这使得 Prometheus 中的kube_pod_container_info指标可以轻松连接。

    • name:容器的名称。

    • image_id:图像 id 以与 k8s pod 的指标相同的格式表示。这使得 Prometheus 中的kube_pod_container_info指标可以轻松连接。

    • pod:如果io.kubernetes.pod.name在容器上设置了pod标签,则其值将设置为指标中的标签

    • namespace:如果io.kubernetes.pod.namespace容器上设置了namespace标签,则其值将设置为指标的标签。

    三、添加PodMonitor 和 PrometheusRule(针对Prometheus Operator)

    在template目录下创建文件podmonitor.yaml

    1. {{ if .Values.prometheusOperator.podMonitor.enabled }}
    2. apiVersion: monitoring.coreos.com/v1
    3. kind: PodMonitor
    4. metadata:
    5. name: {{ include "missing-container-metrics.fullname" . }}
    6. {{- with .Values.prometheusOperator.podMonitor.namespace }}
    7. namespace: {{ . }}
    8. {{- end }}
    9. labels:
    10. {{- include "missing-container-metrics.labels" . | nindent 4 }}
    11. {{- with .Values.prometheusOperator.podMonitor.selector }}
    12. {{- toYaml . | nindent 4 }}
    13. {{- end }}
    14. spec:
    15. selector:
    16. matchLabels:
    17. {{- include "missing-container-metrics.selectorLabels" . | nindent 6 }}
    18. podMetricsEndpoints:
    19. - port: http
    20. namespaceSelector:
    21. matchNames:
    22. - {{ .Release.Namespace }}
    23. {{ end }}

    在template目录下创建文件prometheusrule.yaml

    1. {{ if .Values.prometheusOperator.prometheusRule.enabled }}
    2. apiVersion: monitoring.coreos.com/v1
    3. kind: PrometheusRule
    4. metadata:
    5. name: {{ include "missing-container-metrics.fullname" . }}
    6. {{- with .Values.prometheusOperator.prometheusRule.namespace }}
    7. namespace: {{ . }}
    8. {{- end }}
    9. labels:
    10. {{- include "missing-container-metrics.labels" . | nindent 4 }}
    11. {{- with .Values.prometheusOperator.prometheusRule.selector }}
    12. {{- toYaml . | nindent 4 }}
    13. {{- end }}
    14. spec:
    15. groups:
    16. - name: {{ include "missing-container-metrics.fullname" . }}
    17. rules:
    18. {{- toYaml .Values.prometheusOperator.prometheusRule.rules | nindent 6 }}
    19. {{ end }}

    修改value.yaml,添加如下

    1. useDocker: true
    2. useContainerd: false
    3. ###添加
    4. prometheusOperator:
    5. podMonitor:
    6. # Create a Prometheus Operator PodMonitor resource
    7. enabled: true
    8. # Namespace defaults to the Release namespace but can be overridden
    9. namespace: ""
    10. # Additional labels to add to the PodMonitor so it matches the Operator's podMonitorSelector
    11. selector:
    12. app.kubernetes.io/name: missing-container-metrics
    13. prometheusRule:
    14. # Create a Prometheus Operator PrometheusRule resource
    15. enabled: true
    16. # Namespace defaults to the Release namespace but can be overridden
    17. namespace: ""
    18. # Additional labels to add to the PrometheusRule so it matches the Operator's ruleSelector
    19. selector:
    20. prometheus: k8s
    21. role: alert-rules
    22. # The rules can be set here. An example is defined here but can be overridden.
    23. rules:
    24. - alert: ContainerOOMObserved
    25. annotations:
    26. message: A process in this Pod has been OOMKilled due to exceeding the Kubernetes memory limit at least twice in the last 15 minutes. Look at the metrics to determine if a memory limit increase is required.
    27. expr: sum(increase(container_ooms[15m])) by (exported_namespace, exported_pod) > 2
    28. labels:
    29. severity: warning
    30. - alert: ContainerOOMObserved
    31. annotations:
    32. message: A process in this Pod has been OOMKilled due to exceeding the Kubernetes memory limit at least ten times in the last 15 minutes. Look at the metrics to determine if a memory limit increase is required.
    33. expr: sum(increase(container_ooms[15m])) by (exported_namespace, exported_pod) > 10
    34. labels:
    35. severity: critical

    使用下面命令更新

    1. [root@master-01 addons]# helm upgrade missing-container-metrics -n missing-container-metrics missing-container-metrics/
    2. Release "missing-container-metrics" has been upgraded. Happy Helming!
    3. NAME: missing-container-metrics
    4. LAST DEPLOYED: Tue Jul 6 11:36:02 2021
    5. NAMESPACE: missing-container-metrics
    6. STATUS: deployed
    7. REVISION: 2
    8. TEST SUITE: None

    更新后会创建podmonitor和prometeusrules

    1. [root@master-01 addons]# kubectl get prometheusrules.monitoring.coreos.com -n missing-container-metrics
    2. NAME AGE
    3. missing-container-metrics 15s
    4. [root@master-01 addons]# kubectl get podmonitors.monitoring.coreos.com -n missing-container-metrics
    5. NAME AGE
    6. missing-container-metrics 35s

    我们可以在prometheus ui 上看到相关target和rules

  • 相关阅读:
    如何隐藏自己的代码(很酷)
    [ MSF使用实例 ] 利用永恒之蓝(MS17-010)漏洞导致windows靶机蓝屏并获取靶机权限
    第16节——ref
    一、什么是JAVA
    视频集中存储/直播点播平台EasyDSS点播文件分类功能新升级
    后缀系列
    算法链与管道(上):建立管道
    最新最全Jmeter+InfluxDB1.8+Grafana可视化性能监控平台搭建(win11本地)
    spark的资源调度与任务调度
    SpringBoot 面试题总结 (JavaGuide)
  • 原文地址:https://blog.csdn.net/zfw_666666/article/details/126747143