基于 Docker 搭建一个最小化的 Prometheus Federation「联邦」集群

2023-10-21 10:10:58 浏览数 (1)

一不小心就八月末了,我敲,最近大部分的时间都在了解 Prometheus,一直想“搂”篇文章出来,奈何一直在墨迹,是时候了,不然就九月了,完不成博客的 Flag 了,233333。

前言

本篇文章主要介绍了 Promethues Federation 集群化机制 & 基于 Docker 搭建一个最小化的 Prometheus Federation 集群娱乐环境的相关操作。不是 Step By Step 的。

Prometheus

先回顾一下 Prometheus 的各个生态组件,了解下它们各自承担的责任是是什么。基于下面一张来自 Prometheus 官方文档的架构 & 生态组件图做下简介:

  • Prometheus targets(Jobs/exporters):提供监控 Metrics 的数据源,Prometheus 是基于拉(Pull-Based)模型的监控系统;
  • Pushgateway:Metrics 推送网关,Prometheus 拉取 Metrics 是有时间间隔的,有时候一些短时任务(Short-lived jobs)没有等到 Prometheus 过来拉取其 Metrics 就没了,所以提供了一个这样的组件让 Jobs 主动推送 Metrics 到作为中介的 Pushgateway 组件;
  • Prometheus Server:核心组件
    • Retrieval:Prometheus 的通过 yaml 配置文件进行配置的,可以配置 Prometheus 拉取 Metrics 的时间间隔,告警规则计算配置,爬取数据源配置等能力;
    • TSDB:Prometheus 内置的本地存储时间序列数据库,该数据库经历了从原型到 V1 到现在 V2 版本的演进,做了许多的优化,想了解更多细节可以看看这篇文章 The Evolution of Prometheus Storage Layer;Exporters 是基于文本格式进行 Metrics 的暴露的, V2 版本Prometheus 放弃了原有的 Protocol Buffers 序列化协议,实现了 Text Decoder,优化了性能,官方对此更多的考虑可以看看这篇文档: protobuf_vs_text;
    • HTTP Server:提供了与 TSDB 和 Prometheus 交互的 HTTP API,方便在更多的场景下做一些自定义操作;
  • Service discovery:服务发现机制,Prometheus 内置了基于文本文件、DNS Server和 Consul 的服务发现机制(都可以在配置文件进行配置),规模化监控场景下,方便发现 Prometheus 的爬取对象进行 Metrics 拉取;
  • Alertmanager:告警通知组件,提供了告警分组,告警抑制(Inhibit),告警静默(Silence),邮件通知,WebHook 等机制;值得注意的是 Prometheus Server 提供了告警规则的计算能力,但是通知并不由它完成,而是由 Alertmanager 完成。告警通知是个重活,并不简单。文章搞搞 Prometheus: Alertmanager 对 Alertmanager 进行了深层次的分析,可以微微看下;
  • PromQL:Prometheus 提供的一个 DSL,同于用于时序数据的查询与聚合运算等操作,告警规则(Alerting Rules & Recording Rules)的计算也用到了 PromQL;
  • Prometheus Web UI:Prometheus 自带的一个 Web UI,提供了许多的数据可视化能力,但其对可视化图表的支持有限,所以社区出现了 Grafana 可视化工具,其提供的 Dashboard 管理能力是很强大的,但仍然有缺点,比如这篇文章就说了一点 如何使 Grafana as code。

Federation 机制

Pormetheus Federation(联邦)机制是 Promehteus 本身提供的一种集群化的扩展能力。当我们要监控的服务很多的时候,我们会部署很多的 Prometheus 节点分别 Pull 这些服务暴露的 Metrics,Federation 机制可以讲这些分别部署的 Prometheus 节点所获得的指标聚合起来,存放在一个中心点的 Prometheus。如下图:

在 Prometheus 的配置配置文件,调整如下字段即可使用 Federation 机制:

代码语言:javascript复制
scrape_configs:

  - job_name: 'federate'
    scrape_interval: 10s

    honor_labels: true
    metrics_path: '/federate'

    # 通过 match 参数,配置要拉取的 Metrics,
    # 不要 Pull full metrics
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{job="node"}'
        - '{job="blackbox"}'

    static_configs:
      # 其他 Prometheus 节点
      - targets:
        - 'prometheus-follower-1:9090'
        - 'prometheus-follower-2:9090'

关于 Federation 联邦集群更多的讨论可以看看:别再乱用prometheus联邦了,分享一个multi_remote_read的方案来实现prometheus高可用

基于服务功能分区,我们可以通过 Federation 集群的特性在任务级别对 Prometheus 采集任务进行划分,以支持规模的扩展。

基于 Docker 搭建最小化的 Federation 集群

上文微微 Recap 了一下 Prometheus「普罗米修斯」相关知识,现在回到最小化 Federation 的搭建,本次要搭建的一个最小化 Federation 集群(Architecture)如下图:

可以看到,这里我们使用了两个 Prometheus Follower Container 分别对 Node Exporter 和 Black Exporter 暴露的主机状态相关的 Metrics 和 网络状况相关的 Metrics 进行拉(Pull)取,然后通过一个中心的 Prometheus Leader 对上述指标进行聚合。我们还分别给 Leader 和 一个 Follower 部署了可视化面板 Grafana 用于查看 Metrics。Alertmanager 也通过容器化的方式启动。

告警的通知基于 WebHook,这里使用到了钉钉群机器人,配置了主机内存 & CPU 使用情况的告警,规则如下:

代码语言:javascript复制
groups:
- name: targets
  rules:
  - alert: monitor_service_down
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Monitor service non-operational"
      description: "Service {{ $labels.instance }} is down."

- name: host
  rules:
  - alert: high_cpu_load
    expr: node_load1 > 1.5
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Server under high load"
      description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

  - alert: high_memory_load
    expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes   node_memory_Buffers_bytes   node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 45
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Server memory is almost full"
      description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."

我们通过 Docker Volume 挂载的方式讲 Prometheus 的配置文件和告警规则文件挂载到对应的识别路径;Grafana 的 Dashboard 与登陆相关的配置我们也基于此方式。Prometheus 的更多配置可参考 prometheus configuration:;Grafana 更多的配置参数可以参考:Grafana Provisioning。

Federation 集群的通信我们创建了一个 Docker Network「monitoring_network」。我们使用 Docker—Compose 进行容器的编排,编排文件内容如下:

docker-compose.yml

代码语言:javascript复制
version: '3.5'

networks:
    monitoring_network:

volumes:
    prometheus_leader_data: {}
    prometheus_follower_1_data: {}
    prometheus_follower_2_data: {}
    grafana_leader_data: {}
    grafana_follower_data: {}

services:
    prometheus-leader:
        container_name: prometheus-leader
        image: prom/prometheus
        networks:
            - monitoring_network
        volumes:
            - ./configs/prometheus-leader/prometheus.yml:/etc/prometheus/prometheus.yml
            - ./configs/prometheus-leader/alerts/alert.rules:/etc/prometheus/alert.rules
            - prometheus_leader_data:/prometheus
        ports:
            - "9090:9090"
        command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--web.console.libraries=/etc/prometheus/console_libraries'
            - '--web.console.templates=/etc/prometheus/consoles'
        restart: unless-stopped

    prometheus-follower-1:
        container_name: prometheus-follower-1
        image: prom/prometheus
        networks:
            - monitoring_network
        volumes:
            - ./configs/prometheus-follower-1/prometheus.yml:/etc/prometheus/prometheus.yml
            - ./configs/prometheus-follower-1/records/node_exporter_recording.rules:/etc/prometheus/node_exporter_recording.rules
            - ./configs/prometheus-follower-1/alerts/node_exporter_alert.rules:/etc/prometheus/node_exporter_alert.rules
            - prometheus_follower_1_data:/prometheus
        ports:
            - "9099:9090"
        command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--web.console.libraries=/etc/prometheus/console_libraries'
            - '--web.console.templates=/etc/prometheus/consoles'
        restart: unless-stopped

    prometheus-follower-2:
        container_name: prometheus-follower-2
        image: prom/prometheus
        networks:
            - monitoring_network
        volumes:
            - ./configs/prometheus-follower-2/prometheus.yml:/etc/prometheus/prometheus.yml
            # - ./configs/prometheus-follower-2/records/node_exporter_recording.rules:/etc/prometheus/node_exporter_recording.rules
            # - ./configs/prometheus-follower-2/alerts/node_exporter_alert.rules:/etc/prometheus/node_exporter_alert.rules
            - prometheus_follower_2_data:/prometheus
        ports:
            - "9098:9090"
        command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--web.console.libraries=/etc/prometheus/console_libraries'
            - '--web.console.templates=/etc/prometheus/consoles'
        restart: unless-stopped

    grafana_leader:
        container_name: grafana_leader
        image: grafana/grafana
        networks:
            - monitoring_network
        volumes:
            - ./configs/grafana-leader/provisioning/dashboards:/etc/grafana/provisioning/dashboards
            - ./configs/grafana-leader/provisioning/datasources/config.yml:/etc/grafana/provisioning/datasources/config.yml
            - grafana_leader_data:/etc/grafana
        environment:
            - TERM=linux
            - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
            - GF_SECURITY_ADMIN_USER=admin
            - GF_SECURITY_ADMIN_PASSWORD=admin123456
        ports:
            - "3000:3000"
        restart: unless-stopped

    grafana_follower:
        container_name: grafana_follower
        image: grafana/grafana
        networks:
            - monitoring_network
        volumes:
            - ./configs/grafana-follower/provisioning/dashboards:/etc/grafana/provisioning/dashboards
            - ./configs/grafana-follower/provisioning/datasources/config.yml:/etc/grafana/provisioning/datasources/config.yml
            - grafana_follower_data:/etc/grafana
        environment:
            - TERM=linux
            - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
            - GF_SECURITY_ADMIN_USER=admin
            - GF_SECURITY_ADMIN_PASSWORD=admin123456
        ports:
            - "3001:3000"
        restart: unless-stopped

    node_exporter:
        image: quay.io/prometheus/node-exporter:latest
        container_name: node_exporter_stats
        networks:
            - monitoring_network
        ports:
            - "9100:9100"
        expose:
            - "9100"
        restart: unless-stopped

    blackbox_exporter:
        image: prom/blackbox-exporter
        container_name: blackbox_exporter
        networks:
            - monitoring_network
        ports:
            - "9115:9115"
        restart: unless-stopped

    alertmanager:
        image: prom/alertmanager
        container_name: alertmanager
        networks:
            - monitoring_network
        volumes:
            - ./configs/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
        ports:
            - "9093:9093"
        restart: unless-stopped

    dingtalk-robot:
        image: timonwong/prometheus-webhook-dingtalk
        container_name: dingtalk-robot
        networks:
            - monitoring_network
        ports:
            - "8060:8060"
        volumes:
            - ./configs/dingtalk/config.yml:/etc/prometheus-webhook-dingtalk/config.yml
        restart: unless-stopped

dingtalk 通知的配置参考了这篇文章:将钉钉接入 Prometheus AlertManager WebHook。

已经相关的配置文件和容器编排文件上传到了 GitHub,更多的配置细节可以到 GitHub 的项目仓库 yeshan333/prometheus-federation-minimal-demo 查看,也可将该项目 clone 到本地,跑一下看看:

  • 1、git clone;
代码语言:javascript复制
git clone https://github.com/yeshan333/prometheus-federation-minimal-demo
cd prometheus-federation-minimal-demo
  • 2、更换钉钉机器人的 WebHook 地址,机器人配置的 Webhook 地址在 configs/dingtalk/config.yml 文件;
代码语言:javascript复制
vim configs/dingtalk/config.yml
代码语言:javascript复制
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=<dingtalk-robaot-access-token>
  • 3、通过 docker-compose 启动联邦集群,你可能需要安装 Docker & docker-compose,可参考:Get Docker;
代码语言:javascript复制
docker-compose up -d
  • 4、期待的容器运行状况如下:
代码语言:javascript复制
$ docker-compose ps
NAME                    COMMAND                  SERVICE                 STATUS              PORTS
alertmanager            "/bin/alertmanager -…"   alertmanager            running             0.0.0.0:9093->9093/tcp, :::9093->9093/tcp
blackbox_exporter       "/bin/blackbox_expor…"   blackbox_exporter       running             0.0.0.0:9115->9115/tcp, :::9115->9115/tcp
dingtalk-robot          "/bin/prometheus-web…"   dingtalk-robot          running             0.0.0.0:8060->8060/tcp, :::8060->8060/tcp
grafana_follower        "/run.sh"                grafana_follower        running             0.0.0.0:3001->3000/tcp, :::3001->3000/tcp
grafana_leader          "/run.sh"                grafana_leader          running             0.0.0.0:3000->3000/tcp, :::3000->3000/tcp
node_exporter_stats     "/bin/node_exporter"     node_exporter           running             0.0.0.0:9100->9100/tcp, :::9100->9100/tcp
prometheus-follower-1   "/bin/prometheus --c…"   prometheus-follower-1   running             0.0.0.0:9099->9090/tcp, :::9099->9090/tcp
prometheus-follower-2   "/bin/prometheus --c…"   prometheus-follower-2   running             0.0.0.0:9098->9090/tcp, :::9098->9090/tcp
prometheus-leader       "/bin/prometheus --c…"   prometheus-leader       running             0.0.0.0:9090->9090/tcp, :::9090->9090/tcp
  • 5、查看 Grafana Leader:http://localhost:3000,Alertmanager UI:http://localhost:9093

End.

0 人点赞