在Kubernetes中手动方式部署Prometheus联邦。
当我们有多个Kubernetes集群的时候,这个时候就需要需要指标汇总的需求了,如上图一样,我们假定在外部部署一个Prometheus的Federate,然后去采集当前k8s中的kube-system与default俩个
namespace。
环境
我的本地环境使用的 sealos
一键部署,主要是为了便于测试。
OS | Kubernetes | HostName | IP | Service |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
部署 Prometheus联邦集群
创建prometheus-federate数据目录
代码语言:txt复制# 在m1上执行
mkdir /data/prometheus-federate/
chown -R 65534:65534 /data/prometheus-federate/
创建Prometheus联邦 StorageClass 配置文件
代码语言:txt复制cd /data/manual-deploy/prometheus/
cat prometheus-federate-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus-federate-lpv
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
创建Prometheus联邦pv配置文件
代码语言:txt复制apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-federate-lpv-0
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: prometheus-federate-lpv
local:
path: /data/prometheus-federate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- sealos-k8s-m1
创建Prometheus联邦configmap配置文件
代码语言:txt复制cat prometheus-federate-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-federate-config
namespace: kube-system
data:
alertmanager_rules.yaml: |
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes -(node_memory_MemFree_bytes node_memory_Buffers_bytes node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
team: ops
annotations:
summary: "cluster:{{ $labels.cluster }} {{ $labels.instance }}: High Memory usage detected"
description: "{{ $labels.instance }}: Memory usage is above 55% (current value is: {{ $value }}"
prometheus.yml: |
global:
scrape_interval: 30s
evaluation_interval: 30s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0.alertmanager-operated:9093
- alertmanager-1.alertmanager-operated:9093
rule_files:
- "/etc/prometheus/alertmanager_rules.yaml"
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes.*"}'
- '{job="prometheus"}'
static_configs:
- targets:
- 'prometheus-0.prometheus:9090'
- 'prometheus-1.prometheus:9090'
- 'prometheus-2.prometheus:9090'
创建Prometheus联邦的statefulse文件
代码语言:txt复制cat prometheus-federate-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus-federate
namespace: kube-system
labels:
k8s-app: prometheus-federate
kubernetes.io/cluster-service: "true"
spec:
serviceName: "prometheus-federate"
podManagementPolicy: "Parallel"
replicas: 1
selector:
matchLabels:
k8s-app: prometheus-federate
template:
metadata:
labels:
k8s-app: prometheus-federate
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- prometheus-federate
topologyKey: "kubernetes.io/hostname"
priorityClassName: system-cluster-critical
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: prometheus-federate-configmap-reload
image: "jimmidyson/configmap-reload:v0.4.0"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9091/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
securityContext:
runAsUser: 0
privileged: true
- image: prom/prometheus:v2.20.0
imagePullPolicy: IfNotPresent
name: prometheus
command:
- "/bin/prometheus"
args:
- "--web.listen-address=0.0.0.0:9091"
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=24h"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
ports:
- containerPort: 9091
protocol: TCP
volumeMounts:
- mountPath: "/prometheus"
name: prometheus-federate-data
- mountPath: "/etc/prometheus"
name: config-volume
readinessProbe:
httpGet:
path: /-/ready
port: 9091
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9091
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 1000m
memory: 2500Mi
securityContext:
runAsUser: 0
privileged: true
serviceAccountName: prometheus
volumes:
- name: config-volume
configMap:
name: prometheus-federate-config
volumeClaimTemplates:
- metadata:
name: prometheus-federate-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "prometheus-federate-lpv"
resources:
requests:
storage: 5Gi
创建Prometheus联邦的svc文件
代码语言:txt复制cat prometheus-service-statefulset.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: kube-system
spec:
ports:
- name: prometheus
port: 9090
targetPort: 9090
selector:
k8s-app: prometheus
clusterIP: None
部署
代码语言:txt复制cd /data/manual-deploy/prometheus/
prometheus-federate-configmap.yaml
prometheus-federate-pv.yaml
prometheus-federate-service-statefulset.yaml
prometheus-federate-statefulset.yaml
prometheus-federate-storageclass.yaml
kubectl apply -f prometheus-federate-storageclass.yaml
kubectl apply -f prometheus-federate-pv.yaml
kubectl apply -f prometheus-federate-configmap.yaml
kubectl apply -f prometheus-federate-statefulset.yaml
kubectl apply -f prometheus-federate-service-statefulset.yaml
验证
代码语言:txt复制# pv
kubectl -n kube-system get pvc |grep federate
prometheus-federate-data-prometheus-federate-0 Bound prometheus-federate-lpv-0 10Gi RWO prometheus-federate-lpv 4h
kubectl -n kube-system get pod |grep federate
prometheus-federate-0 2/2 Running 0 2d4h
对此,联邦的配置就完成了,可以使用浏览器访问192.168.1.151:9091查看相应的targets信息,以及配置的rules规则,触发下警报,看看Alertmanager集群的警报功能是不是已经正常了