构建企业级监控平台系列（二十四）：Prometheus 配置 Grafana 展示与报警

前面介绍了 Prometheus 标签 label、PromQL、AlertManager、Alertmanager 配置实现钉钉告警、Pushgateway、基于K8S服务发现、监控常见服务等相关的知识点，今天我将详细的为大家介绍 Prometheus 配置 Grafana 展示与报警相关知识，希望大家能够从中收获多多！如有帮助，请点在看、转发朋友圈支持一波！！！

前面介绍了Prometheus AlertManager、Alertmanager告警、Pushgateway、基于K8S服务发现、监控常见服务等相关的知识点，今天我将详细的为大家介绍 Prometheus 配置 Grafana 展示与报警相关知识，希望大家能够从中收获多多！如有帮助，请点在看、转发朋友圈支持一波！！！

二进制包部署Prometheus

环境准备工作

代码语言：javascript复制

Prometheus服务器 192.168.109.138   Prometheus、node_exporter
grafana服务器     192.168.109.138   Grafana
被监控服务器       192.168.109.0/24  node_exporter

部署

上传prometheus-2.35.0.linux-amd64.tar.gz到/opt目录中，并解压。

代码语言：javascript复制

#解压上传后的软件包
root@localhost opt]# tar xf prometheus-2.35.0.linux-amd64.tar.gz
#移动并命名
[root@localhost opt]# mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus
[root@localhost opt]# cd /usr/local/prometheus
[root@localhost prometheus]# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool

配置文件

代码语言：javascript复制

cat /usr/local/prometheus/prometheus.yml | grep -v "^#"
global:     #用于prometheus的全局配置，比如采集间隔，抓取超时时间等
  scrape_interval: 15s   #采集目标主机监控数据的时间间隔，默认为1m
  evaluation_interval: 15s   #触发告警生成alert的时间间隔，默认是1m
  # scrape_timeout is set to the global default (10s).
  scrape_timeout: 10s   #数据采集超时时间，默认10s

alerting:    #用于alertmanager实例的配置，支持静态配置和动态服务发现的机制
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:    #用于加载告警规则相关的文件路径的配置，可以使用文件名通配机制
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:   #用于采集时序数据源的配置
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"  #每个被监控实例的集合用job_name命名，支持静态配置（static_configs）和动态服务发现的机制（*_sd_configs）

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:    #静态目标配置，固定从某个target拉取数据
      - targets: ["localhost:9090"]

配置系统启动文件，启动 Prometheust

代码语言：javascript复制

cat > /usr/lib/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus 
--config.file=/usr/local/prometheus/prometheus.yml 
--storage.tsdb.path=/usr/local/prometheus/data/ 
--storage.tsdb.retention=15d 
--web.enable-lifecycle
  
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

---------------------------------------------------------------
[Unit]  #服务单元
Description=Prometheus Server  #描述
Documentation=https://prometheus.io  
After=network.target   #依赖关系

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus 
--config.file=/usr/local/prometheus/prometheus.yml   #配置文件
--storage.tsdb.path=/usr/local/prometheus/data/   #数据目录
--storage.tsdb.retention=15d   #保存时间
--web.enable-lifecycle  #开启热加载
  
ExecReload=/bin/kill -HUP $MAINPID  #重载
Restart=on-failure

[Install]
WantedBy=multi-user.target

更多关于企业级监控平台系列的学习文章，请参阅：构建企业级监控平台，本系列持续更新中。

启动

代码语言：javascript复制

systemctl start prometheus
systemctl enable prometheus

netstat -natp | grep :9090

浏览器访问：http://192.168.109.138:9090 ，访问到 Prometheus 的 Web UI 界面。点击页面的 Status -> Targets，如看到 Target 状态都为 UP，说明 Prometheus 能正常采集到数据。

http://192.168.109.138:9090/metrics ，可以看到 Prometheus 采集到自己的指标数据。

部署 Exporters

部署 Node Exporter 监控系统级指标，上传 node_exporter-1.3.1.linux-amd64.tar.gz 到 /opt 目录中，并解压。

代码语言：javascript复制

cd /opt/
tar xf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin

配置启动文件

代码语言：javascript复制

cat > /usr/lib/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter 
--collector.ntp 
--collector.mountstats 
--collector.systemd 
--collector.tcpstat

ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动

代码语言：javascript复制

systemctl start node_exporter
systemctl enable node_exporter

netstat -natp | grep :9100

浏览器访问：http://192.168.109.138:9100/metrics ，可以看到 Node Exporter 采集到的指标数据。

更多关于企业级监控平台系列的学习文章，请参阅：构建企业级监控平台，本系列持续更新中。

常用的各指标：

node_cpu_seconds_total
node_memory_MemTotal_bytes
node_filesystem_size_bytes{mount_point=PATH}
node_system_unit_state{name=}
node_vmstat_pswpin：系统每秒从磁盘读到内存的字节数
node_vmstat_pswpout：系统每秒钟从内存写到磁盘的字节数

修改 prometheus 配置文件，加入到 prometheus 监控中。

代码语言：javascript复制

vim /usr/local/prometheus/prometheus.yml
#在尾部增加如下内容
  - job_name: nodes
    metrics_path: "/metrics"
    static_configs:
    - targets:
   - 192.168.109.138:9100
   - 192.168.109.137:9100
   - 192.168.109.136:9100
      labels:
        service: kubernetes

重新载入配置

代码语言：javascript复制

curl -X POST http://192.168.109.138:9090/-/reload     #热加载
或systemctl reload prometheus

浏览器查看 Prometheus 页面的 Status -> Targets

部署Grafana进行展示

下载和安装

下载地址：

https://grafana.com/grafana/download
https://mirrors.bfsu.edu.cn/grafana/yum/rpm/

代码语言：javascript复制

#使用yum解决依赖关系  我这边直接上传软件包到opt
yum install -y grafana-7.4.0-1.x86_64.rpm

代码语言：javascript复制

systemctl start grafana-server
systemctl enable grafana-server

netstat -natp | grep :3000

浏览器访问：http://192.168.109.138:3000 ，默认账号和密码为 admin/admin。

配置数据源

Configuration -> Data Sources -> Add data source -> 选择 Prometheus。HTTP -> URL 输入 http://192.168.109.138:9090，点击 Save & Test。

点击上方菜单 Dashboards，Import 所有默认模板。Dashboards -> Manage ，选择 Prometheus 2.0 Stats 或 Prometheus Stats 即可看到 Prometheus job 实例的监控图像。

导入 grafana 监控面板

浏览器访问：https://grafana.com/grafana/dashboards ，在页面中搜索 node exporter ，选择适合的面板，点击 Copy ID 或者 Download JSON。

在 grafana 页面中， Create -> Import ，输入面板 ID 号或者上传 JSON 文件，点击 Load，即可导入监控面板。

更多关于企业级监控平台系列的学习文章，请参阅：构建企业级监控平台，本系列持续更新中。

部署 Prometheus 服务发现

基于文件的服务发现

基于文件的服务发现是仅仅略优于静态配置的服务发现方式，它不依赖于任何平台或第三方服务，因而也是最为简单和通用的实现方式。

Prometheus Server 会定期从文件中加载 Target 信息，文件可使用 YAML 和 JSON 格式，它含有定义的 Target 列表，以及可选的标签信息。

创建用于服务发现的文件，在文件中配置所需的 target。

代码语言：javascript复制

cd /usr/local/prometheus
mkdir targets

vim targets/node-exporter.yaml
- targets:
  - 192.168.109.131:9100
  - 192.168.109.132:9100
  - 192.168.109.133:9100
  labels:
    app: node-exporter
    job: node

#修改 prometheus 配置文件，发现 target 的配置，定义在配置文件的 job 之中
vim /usr/local/prometheus/prometheus.yml
......
scrape_configs:
  - job_name: nodes
    file_sd_configs:                  #指定使用文件服务发现
    - files:                          #指定要加载的文件列表
      - targets/node*.yaml            #文件加载支持通配符
      refresh_interval: 2m            #每隔 2 分钟重新加载一次文件中定义的 Targets，默认为 5m
systemctl reload prometheus
浏览器查看 Prometheus 页面的 Status -> Targets

前提是该node节点装好node-exporter组件，这个步骤在前面就不展示了，可以使用scp命令从普罗米修斯机子传过去。

基于 Consul 的服务发现

Consul 是一款基于 golang 开发的开源工具，主要面向分布式，服务化的系统提供服务注册、服务发现和配置管理的功能。提供服务注册/发现、健康检查、Key/Value存储、多数据中心和分布式一致性保证等功能。

下载地址：https://www.consul.io/downloads/

部署 Consul 服务

代码语言：javascript复制

cd /opt/
unzip consul_1.9.2_linux_amd64.zip
mv consul /usr/local/bin/

#创建 Consul 服务的数据目录和配置目录
mkdir /var/lib/consul-data
mkdir /etc/consul/

#使用 server 模式启动 Consul 服务
consul agent 
-server 
-bootstrap 
-ui 
-data-dir=/var/lib/consul-data 
-config-dir=/etc/consul/ 
-bind=192.168.109.138 
-client=0.0.0.0 
-node=consul-server01 &> /var/log/consul.log &

#查看 consul 集群成员
consul members

在 Consul 上注册 Services

代码语言：javascript复制

#在配置目录中添加文件
vim /etc/consul/nodes.json
{
  "services": [
    {
      "id": "node_exporter-node01",
      "name": "node01",
      "address": "192.168.109.138",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.109.138:9100/metrics",
        "interval": "5s"
      }]
    },
    {
      "id": "node_exporter-node02",
      "name": "node02",
      "address": "192.168.109.134",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.109.134:9100/metrics",
        "interval": "5s"
      }]
    }
  ]
}

#让 consul 重新加载配置信息
consul reload

浏览器访问：http://192.168.109.138:8500

更多关于企业级监控平台系列的学习文章，请参阅：构建企业级监控平台，本系列持续更新中。

修改 prometheus 配置文件

代码语言：javascript复制

vim /usr/local/prometheus/prometheus.yml
......
  - job_name: nodes
    consul_sd_configs:                  #指定使用 consul 服务发现
    - server: 192.168.109.138:8500        #指定 consul 服务的端点列表
      tags:                             #指定 consul 服务发现的 services 中哪些 service 能够加入到 prometheus 监控的标签
      - nodes
      refresh_interval: 2m

代码语言：javascript复制

systemctl reload prometheus

浏览器查看 Prometheus 页面的 Status -> Targets

代码语言：javascript复制

#让 consul 注销 Service
consul services deregister -id="node_exporter-node02"

代码语言：javascript复制

#重新注册
consul services register /etc/consul/nodes.json

Grafana onealert报警

Prometheus 报警需要使用 alertmanager 这个组件，而且报警规则需要手动编写（对运维来说不友好）。所以我这里选用 grafana onealert 报警。注意：实现报警前把所有机器时间同步再检查一遍。

登陆http://www.onealert.com/→注册帐户→登入后台管理

在Grafana中配置Webhook URL

1、在Grafana中创建Notification channel，选择类型为Webhook；
2、推荐选中Send on all alerts和Include image，Cloud Alert体验更佳；
3、将第一步中生成的Webhook URL填入Webhook settings Url；
URL格式：http://api.aiops.com/alert/api/event/grafana/v1/7a2eb59ab2d24483847b17e74bd9b255/
4、Http Method选择POST；
5、Send Test&Save；

在grafana增加通知通道

现在可以去设置一个报警来测试了（这里以我们前面加的 cpu 负载监控来做测试）

保存后就可以测试了，如果 agent1上的 cpu 负载还没有到 0.3，你可以试试 0.1，或者运行一些程序把 agent1负载调大。

最终的邮件报警效果：

更多关于企业级监控平台系列的学习文章，请参阅：构建企业级监控平台，本系列持续更新中。

参考链接：https://stevelu.blog.csdn.net/article/details /126080391 https://blog.csdn.net/weixin_67470255 /article/details/126329953

grafana prometheus 服务监控配置

0 人点赞