1 .版本要求
k8s集群版本 | kube-prometheus版本 | 部署方式 |
---|---|---|
v1.18 | <=v0.6.0 | 单节点中心化部署 |
2. 最小化安装说明
服务 | 是否保留部署 | 副本数 | 部署形式 |
---|---|---|---|
alertmanager-main | 是 | 1 | statefulset |
kube-state-metrics | 是 | 1 | deployment |
node-exporter | 是 | 1 | daemonset |
prometheus-adapter | 是 | 1 | deployment |
prometheus-operator | 是 | 1 | deployment |
grafana | 是 | 1 | deployment |
prometheus-k8s | 是 | 1 | statefulset |
blackbox-exporter | 否 | deployment |
3.告警模块配置(alertmanager-secret.yaml)
代码语言:txt复制apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "simplecloud"
"webhook_configs":
- "url": "http://xxx:8554/notifications"
"http_config":
"bearer_token": "xxx"
- "name": "Watchdog"
- "name": "Critical"
"route":
"group_by":
- "namespace"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "xxx"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "Watchdog"
- "match":
"severity": "critical"
"repeat_interval": "1h"
"receiver": "Critical"
- "match":
"severity": "warning"
"repeat_interval": "1d"
- "match":
"severity": "info"
"repeat_interval": "7d"
type: Opaque
4.告警规则配置(prometheus-rules.yaml)
代码语言:txt复制- name: Pod状态异常
rules:
- alert: Pod状态异常
annotations:
description: The pod {{ $labels.pod }} in namespace {{ $labels.namespace }}
was unavailable.
summary: Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is unavailable.
expr: min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m])
> 0
for: 2m
labels:
severity: critical
- name: Deployment可用副本状态异常
rules:
- alert: 工作负载可用副本数异常
annotations:
description: The pods of {{ $labels.deployment}} is unavalilable.
summary: The Status of {{ $labels.deployment}} pods is abnomal
expr: kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{}
for: 2m
labels:
severity: critical
- name: Pod启动失败
rules:
- alert: 5分钟内Pod重启累计3次以上
annotations:
description: The Pod {{ $labels.namespace }}/{{ $labels.pod }} has failed
to start.
summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} failed to start
expr: sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m])
>3
for: 5m
labels:
severity: critical
更多个性化告警规则配置可参考阿里云告警配置,这里插入友方超链接会被屏蔽,有需要的小伙伴可以在文章底下私信我。
5.k8s常用指标自定义标签配置
原脚本所有xxx-serviceMonitor.yaml添加以下配置片段:
代码语言:txt复制apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: prometheus
name: prometheus
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: web
metricRelabelings:
- sourceLabels: []
targetLabel: env
replacement: '测试'
- sourceLabels: []
targetLabel: cluster
replacement: '华南1b测试'
- replacement: k8s-test
sourceLabels: []
targetLabel: type
- replacement: huanan1b-sc-test
sourceLabels: []
targetLabel: from
- replacement: prometheus-k8s-0
sourceLabels: []
targetLabel: prometheus_replica
selector:
matchLabels:
prometheus: k8s
6.cadvisor指标自定义标签配置
代码语言:txt复制remote_write:
- url: "http://remote-write-service:9090/api/v1/write"
write_relabel_configs:
- source_labels: ["__name__"]
regex: "my_metric|another_metric|yet_another_metric"
action: keep