构建企业级监控平台系列(二十三):Prometheus 配置监控常用服务实践

2023-10-31 19:48:21 浏览数 (1)

URL监控通过blackbox-exporter组件监控,组件部署位置192.168.0.39。

创建组件配置文件
代码语言:javascript复制
vim  /data/prometheus_dir/blackbox_exporter/blackbox.yml
modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^ OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ] )"
        send: "PONG ${1}"
      - expect: "^:[^ ]  001"
  icmp:
    prober: icmp
启动组件容器
代码语言:javascript复制
docker run -d 
-p 9300:9115 
--name blackbox_exporter 
--restart=always 
--restart=on-failure:5 
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro 
-v /data/prometheus_dir/blackbox_exporter/blackbox.yml:/config/blackbox.yml 
prom/blackbox-exporter:master 
--config.file=/config/blackbox.yml
prometheus集成blackbox组件

prometheus.yml 添加如下配置:

代码语言:javascript复制
# http检测配置
  - job_name: 'blackbox'
    scrape_interval: 10s
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://www.baidu.com
        - https://www.aliyun.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.0.39:9300

重启prometheus

更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

添加告警规则

http_export-alert-rules.yaml

代码语言:javascript复制
groups:
    - name: nginx状态-监控告警
      rules:
      - alert: 状态码检测
        expr: probe_http_status_code{job="blackbox"} != 200
        for: 0m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "请求URL状态码非200"
          description: "请求{{$.Labels.instance}}状态码非200"
 
      - alert: 证书过期时间检测
        expr: probe_ssl_earliest_cert_expiry {job="blackbox"} -time() < 86400 * 30
        for: 5m
        labels:
          serverity: warning
          status: 警告
        annotations:
          summary: "证书过期时间不足30天"
          description: "{{$.Labels.instance}}证书还有30天到期,请及时更换"
      
      - alert: 页面响应时间检测
        expr: probe_duration_seconds{job="blackbox"} >= 1
        for: 1m
        labels:
          serverity: warning
          status: 警告
        annotations:
          summary: "页面响应时间超过1秒"
          description: "{{$.Labels.instance}}页面响应时间超过1秒"

重启prometheus生效,添加grafana图形https://grafana.com/grafana/dashboards/7587。

post请求监控

因为post请求的headers头和body不同,所以要想监控post请求就需要根据headers和body做自定义模块。举例:

代码语言:javascript复制
url:https://aaa.bbb.com/api
headers:
        userid:1111111
body:
        {"templateKey":"AD_MA","ext":{"skuId":"-1"}}

可以用postman工具去进行请求测试一下是否能正常返回接口内容。

可以看到返回数据正常,接下来就可以配置模块了。

代码语言:javascript复制
vim /data/prometheus_dir/blackbox_exporter/blackbox.yml

POST_api:  # 模块名称
    prober: http
    timeout: 30s
    http:
      method: POST
      headers:
        Content-Type: application/json
        userid: 111111
      body: {"templateKey":"AD_MA","ext":{"skuId":"-1"}}

vim prometheus.yml 监控集成到prometheus中

代码语言:javascript复制
  - job_name: 'blackbox_POST_api'
    scrape_interval: 20s
    metrics_path: /probe
    params:
      module: [POST_api]  # 匹配模块名称
    static_configs:
      - targets:
        - https://aaa.bbb.com
        labels:
          url_name: "POST xxxxapi"  # 自定义的一个标签,后续可以直接显示到告警中方便查看
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.0.39:9300

prometheus重启后会出现刚刚加入的监控,这里名称是我真实的,所以不匹配,无需在意。

更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

告警

下面是两个告警规则,因为有的url接口就是慢,也不影响业务,所以统一配置成超过1秒告警,会导致某些时间频繁报警,所以可以根据 =~ 和 = 的方法配置某一个规则大于1.5秒才告警。

代码语言:javascript复制
- alert: 页面响应时间检测
  expr: probe_duration_seconds{job="blackbox_POST_choiceList"} >= 1.5
  for: 1m
  labels:
    serverity: warning
    status: 警告
  annotations:
    summary: "{{$.Labels.instance}}页面响应时间超过2秒"
          description: "服务:{{$.Labels.url_name}}---响应时间>=1.5s,(当前:{{$value}})"
代码语言:javascript复制
- alert: 页面响应时间检测
  expr: probe_duration_seconds{job=~"blackbox.*",job!="blackbox_POST_choiceList"} >= 1
  for: 1m
  labels:
    serverity: warning
    status: 警告
  annotations:
    summary: "{{$.Labels.instance}}页面响应时间超过1秒"
    description: "服务:{{$.Labels.url_name}}---响应时间>=1s,(当前:{{$value}})"

告警消息

prometheus 配置监控 nginx

github地址:https://github.com/nginxinc/nginx-prometheus-exporter

环境
  • 组件nginx-prometheus-exporter部署位置:192.168.0.39
  • nginx服务器:172.30.0.10
  • 通过组件nginx-prometheus-exporter进行监控

nginx需要安装有with-http_stub_status_module模块,一般高版本的nginx会自带这个模块。检查命令如下:

代码语言:javascript复制
# nginx -V 2>&1 | grep -o with-http_stub_status_modulewith-http_stub_status_module
nginx增加配置

创建一个server,监听38888端口,开启监控,只允许192.168.0.39访问。

代码语言:javascript复制
server {
        listen  38888;
        location /nginx_status {
            stub_status on;
            allow 192.168.0.39;  #only allow requests from localhost
            deny all;   #deny all other hosts
        }
    }

测试一下:172.30.0.10就是刚刚那台nginx主机。

代码语言:javascript复制
# curl http://172.30.0.10:38868/nginx_status
 
Active connections: 12189 
server accepts handled requests
 195544839 195544839 1147258694 
Reading: 0 Writing: 63 Waiting: 12018
创建启动监控组件

需要指定被监控主机,我启动了多个组件,端口:9117 自定义的。

代码语言:javascript复制
docker run -d 
--name nginx_exporter_qalb_10 
-m 1g 
--restart=always 
--restart=on-failure:5 
-p 9117:9113 
nginx/nginx-prometheus-exporter:0.10.0 
-nginx.scrape-uri http://172.30.0.10:38888/nginx_status
prometheus集成nginx-prometheus-exporter组件

prometheus.yml 添加如下配置:

代码语言:javascript复制
# nginx
  - job_name: nginx-qalb-10
    static_configs:
      - targets: ['192.168.0.10:9117']
        labels:
          instance: nginx-qalb-10
 
有多个组件就再加一个,注意端口别冲突
  - job_name: nginx-qalb-11
    static_configs:
      - targets: ['192.168.0.10:9118']
        labels:
          instance: nginx-qalb-11

重启prometheus容器

添加告警规则

nginx_export-alert-rules.yaml

代码语言:javascript复制
groups:
    - name: nginx状态-监控告警
      rules:
      - alert: nginx状态
        expr: nginx_up == 0
        for: 1s
        labels:
          serverity: warning
          status: 非常严重
        annotations:
          summary: "{{$labels.instance}}:nginx服务停止"
          description: "nginx服务down"

重启prometheus容器生效。更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

prometheus配置监控SSL请求

为啥监控ssl请求时间

公司服务在云环境下,nginx的负载使用的是云上负载,但未配置https证书托管,而是将证书放到负载后端的每台nginx上,nginx也未做ssl证书相关优化,所以当并发达到一定量时,可能会出现某一台nginx服务器ssl请求非常慢。

每次解决需要绑定hosts去curl每一个节点,才能判断出哪台有问题。为了快速报警哪台服务器ssl握手慢,所以采用监控方式进行探测及告警。

最开始使用了三台服务器对三台nginx进行hosts绑定,然后编写py脚本进行告警,目的能达到,但时很不方便,所以想到用docker容器进行hosts绑定,通过prometheus调用的方式采集结果。

  • 域名:https://www.aaa.com
  • nginx服务器:192.168.100.1 192.168.100.2
  • 环境:docker python
安装模块
代码语言:javascript复制
pip install prometheus_client
pip install flask

探测脚本:

代码语言:javascript复制
# cat nginx-ssl-check.py
 
import os
import re
import prometheus_client
from prometheus_client import Gauge
from flask import Response, Flask
 
app = Flask(__name__)
 
SSL = Gauge('SSL_handshake', 'SSL_handshake')
 
@app.route("/metrics")
def ssl_handshake():
    num = os.popen('curl -w "TCP handshake: %{time_connect}, SSL handshake: %{time_appconnect}n" -so /dev/null https://www.aaa.com/').read()
    SSL_handshake = re.findall(r"SSL handshake: (. )", num)
    f_SSL = float(SSL_handshake[0])
    SSL.set(f_SSL)
    return Response(prometheus_client.generate_latest(SSL), mimetype="text/plain")
 
 
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

启动运行后,访问

代码语言:javascript复制
# curl http://localhost:8000/metrics

结果如下

代码语言:javascript复制
# HELP SSL_handshake SSL_handshake

# TYPE SSL_handshake gauge

SSL_handshake 0.124363                这个就是ssl的请求时间
使用docker启动

先构建一个镜像,安装模块

代码语言:javascript复制
# cat Dockerfile 
FROM python:3.9.13
 
RUN /usr/local/bin/python -m pip install --upgrade pip
RUN pip3 install prometheus_client
RUN pip3 install flask
 
CMD python3 /data/nginx-ssl-check.py

构建镜像

代码语言:javascript复制
docker build -t promehtues_flask_py:v1 .

将上面的python脚本放到服务器目录中,这样是为了多个容器可以同时使用一个脚本,脚本目录:

代码语言:javascript复制
/data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py

启动容器:

代码语言:javascript复制
docker run -d 
-p 8000:8000 
--name nginx-ssl-check-192.168.100.1 
--restart=always 
--restart=on-failure:5 
--add-host www.aaa.com:192.168.100.1 
-v /data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py:/data/nginx-ssl-check.py 
promehtues_flask_py:v1
代码语言:javascript复制
docker run -d 
-p 8001:8000 
--name nginx-ssl-check-192.168.100.2 
--restart=always 
--restart=on-failure:5 
--add-host www.aaa.com:192.168.100.2 
-v /data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py:/data/nginx-ssl-check.py 
promehtues_flask_py:v1

注意容器中需要绑定hosts,用于探测对应主机的ssl,而不是负载的方式探测,负载方式是无法知道当前是哪一台nginx的ssl返回慢的。

测试一下:

  • curl http://localhost:8000/metrics
  • curl http://localhost:8001/metrics
prometheus集成
代码语言:javascript复制
# nginx ssl 握手时间检测
  - job_name: nginx_ssl_check-192.168.100.1
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.100.200:8000']
        labels:
          instance: nginx-ssl-check-192.168.100.1
 
  - job_name: nginx_ssl_check-192.168.100.2
    scrape_interval: 5s
    static_configs:
      - targets: ['192.168.100.200:8001']
        labels:
          instance: nginx-ssl-check-192.168.100.2
重启prometheus
告警
代码语言:javascript复制
# cat rules/nginx_ssl_check-rules.yaml 
groups:
    - name: nginx-ssl请求-监控告警
      rules:
      - alert: ssl请求告警
        expr: SSL_handshake > 3
        for: 0m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "ssl请求:{{$.Labels.instance}}超过3秒"
          description: "ssl请求:{{$.Labels.instance}}---超过3秒,(当前:{{$value}})"

更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

prometheus配置监控 Kafka

  • 监控组件:kafka-exporter
  • github地址:GitHub - imduffy15/kafka_exporter: Kafka exporter for Prometheus

启动:

代码语言:javascript复制
docker run -d 
--restart=always 
--restart=on-failure:5 
--name kafka_exporter 
-v /etc/localtime:/etc/localtime 
-p 9308:9308 
danielqsj/kafka-exporter:v1.2.0 
--kafka.server=172.30.0.11:9092
prometheus集成kafka_exporter
代码语言:javascript复制
vim prometheus.yml

# kafka 监控
- job_name: 'kafka-172.30.0.11'
  scrape_interval: 10s
  static_configs:
    - targets: ['192.168.0.39:9308']
      labels:
        kafka_ip: 'kafka-172.30.0.11'

重启prometheus容器生效。

grafana码:7589 https://grafana.com/grafana/dashboards/7589

告警规则
代码语言:javascript复制
# cat rules/kafka-export-alert-rules.yaml 
 
 
groups:
    - name: kafka消费滞后告警
      rules:
      - alert: kafka消费滞后
        expr: sum(kafka_consumergroup_lag{topic!="sop_free_study_fix-student_wechat_detail"}) by (consumergroup, topic) > 1000
        for: 3m
        labels:
          serverity: warning
          status: 严重
        annotations:
          summary: "kafka消费滞后"
          description: "{{$.Labels.consumergroup}}##{{$.Labels.topic}}:消费滞后超过1000持续3分钟(当前:{{$value}})"
 
      - alert: kafka-exporter down
        expr: kafka_exporter_build_info < 1
        for: 3m
        labels:
          serverity: warning
          status: 严重
        annotations:
          summary: "kafka-exporter down"
          description: "kafka-exporter down {{$.Labels.instance}}"
 
      - alert: kafka server down
        expr: kafka_brokers < 1
        for: 3m
        labels:
          serverity: warning
          status: 严重
        annotations:
          summary: "kafka server down"
          description: "kafka server down {{$.Labels.job}}"

prometheus配置监控 Mysql

  • 监控组件:mysqld-exporter
  • github地址:GitHub - prometheus/mysqld_exporter: Exporter for MySQL server metrics
  • mysqld-exporter 部署位置 192.168.0.39
  • 被监控mysql部署位置 192.168.0.10
被监控数据库添加用户权限
代码语言:javascript复制
CREATE USER 'mysql_exporter'@'192.168.0.39' IDENTIFIED BY '111111';
 
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'mysql_exporter'@'192.168.0.10';
 
flush privileges;
启动mysqld-exporter
代码语言:javascript复制
docker run -d 
--name mysql-192.168.0.10 
-p 9510:9104 
--restart=always 
--restart=on-failure:5 
-e DATA_SOURCE_NAME="mysql_exporter:111111@(192.168.0.10:3306)/" 
prom/mysqld-exporter
prometheus集成mysqld-exporter

# mysqld_exporter
  - job_name: mysql-192.168.0.10
    static_configs:
    - targets: ['192.168.0.39:9510']

grafana图形代码 7362

告警配置

我这里只包含了主从的

代码语言:javascript复制
# cat rules/mysql_export-alert-rules.yaml 
groups:
    - name: mysql主从监控告警
      rules:
      - alert: mysql主从Slave_IO告警
        expr: mysql_slave_status_slave_io_running == 0
        for: 1s
        labels:
          serverity: warning
          status: 非常严重
        annotations:
          description: "{{$labels.job}}:mysql主从Slave_IO停止"
          summary: "mysql主从Slave_IO停止"
 
      - alert: mysql主从Slave_SQL告警
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1s
        labels:
          serverity: warning
          status: 非常严重
        annotations:
          description: "{{$labels.job}}:mysql主从Slave_SQL停止"
          summary: "mysql主从Slave_SQL停止"
 
      - alert: mysql主从延时告警  
        expr: mysql_slave_status_seconds_behind_master > 60
        for: 3m
        labels:
          serverity: warning
          status: 非常严重
        annotations:
          description: "{{$labels.job}}:mysql主从延时>60s,(当前:{{$value}})"
          summary: "mysql主从Slave_SQL停止"

更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

prometheus配置监控 ElasticSearch

  • 组件名称:elasticsearch-exporter
  • github地址:GitHub - prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus
  • 监控目标:192.168.0.100、192.168.0.101、192.168.0.102集群
  • 组件部署位置:192.168.0.39
docker-cpmpose启动
代码语言:javascript复制
cat /data/docker-compose_dir/elastic/docker-compose.yml

version: '3'
services:
  elasticsearch_exporter-192.168.0.100:
    image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
    command:
     - '--es.uri=http://192.168.0.100:9200'
    restart: always
    ports:
    - "0.0.0.0:9600:9114"
 
  elasticsearch_exporter-192.168.0.101:
    image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
    command:
     - '--es.uri=http://192.168.0.101:9200'
    restart: always
    ports:
    - "0.0.0.0:9601:9114"
 
  elasticsearch_exporter-192.168.0.102:
    image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
    command:
     - '--es.uri=http://192.168.0.102:9200'
    restart: always
    ports:
    - "0.0.0.0:9602:9114"
启动
代码语言:javascript复制
docker-compose up -d
prometheus集成组件
代码语言:javascript复制
# elasticsearch_exporter
  - job_name: elastic-192.168.0.100
    scrape_interval: 15s
    static_configs:
    - targets: ['192.168.0.39:9600']
 
  - job_name: elastic-192.168.0.101
    scrape_interval: 15s
    static_configs:
    - targets: ['192.168.0.39:9601']
 
  - job_name: elastic-192.168.0.102
    scrape_interval: 15s
    static_configs:
    - targets: ['192.168.0.39:9602']
告警配置
代码语言:javascript复制
# cat elastic-rules.yaml

 groups:
    - name: ElasticSearch-监控告警
      rules:
      - alert: 集群节点数减少告警
        expr: elasticsearch_cluster_health_number_of_nodes < 4
        for: 5m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "ES集群节点数减少:{{$.Labels.job}}"
          description: "ES集群节点数减少:{{$.Labels.job}},(当前:{{$value}})"
     
      - alert: jvm内存使用率告警
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
        for: 5m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "jvm内存使用率过高:{{$.Labels.job}}"
          description: "jvm内存使用率过高:{{$.Labels.job}} => 0.9,(当前:{{$value}})"

重启prometheus

代码语言:javascript复制
docker restart prometheus

grafana图形:2322 具体需要微调一下。

prometheus配置监控 Java 服务

让开发在springboot项目集成Micrometer。完成后访问试试

代码语言:javascript复制
# curl http://ip:port/actuator/prometheus

会出现很多数据就是正常的。否则无法操作下面的步骤了。

prometheus集成
代码语言:javascript复制
# java
  - job_name: java
    scrape_interval: 10s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['172.30.0.8:8986',
                  '172.30.0.11:8986',
                  '172.30.0.21:8986']
        labels:
          service_name: 'aaa'
 
      - targets: ['172.30.0.18:18278',
                  '172.30.0.25:18278',
                  '172.30.0.36:18278']
        labels:
          service_name: 'bbb'

grafana图形码:6756 12856,主用12856,中间插点6756。

告警规则
代码语言:javascript复制
# cat /data/prometheus_dir/rules/java-rules.yaml 
groups:
    - name: JAVA服务-监控告警
      rules:
      - alert: Java 服务停止告警
        expr: up{job="java"} == 0
        for: 1m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "服务停止:{{$.Labels.service_name}}--{{$.Labels.instance}}"
          description: "服务停止:{{$.Labels.service_name}}--{{$.Labels.instance}},(当前
:{{$value}})"
 
      - alert: Java 接口延迟告警
        expr: irate(http_server_requests_seconds_sum{ job="java",exception="None", uri!~".*actuator.*"}[1m]) / irate(http_server_requests_seconds_count{ job="java",exception="None", uri!~".*actuator.*"}[1m]) > 3
        for: 1m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "接口延迟:{{$.Labels.job}}"
          description: "接口延迟:{{$.Labels.service_name}}--{{$.Labels.instance}} > 3s,(当前:{{$value}})"
     
      - alert: Java 接口状态码告警
        expr: http_server_requests_seconds_count{job="java",uri!="/**",status!='200'}
        for: 1m
        labels:
          severity: warning
          status: 非常严重
        annotations:
          summary: "接口状态码异常:{{$.Labels.service_name}}--{{$.Labels.instance}}"
          description: "接口状态码异常:{{$.Labels.service_name}}--{{$.Labels.instance}}--{{$.Labels.method}}--{{$.Labels.uri}},(当前:{{$.Labels.status}})"
 
      - alert: Java GC次数告警
        expr: irate(jvm_gc_pause_seconds_count{job="java"}[1m]) > 5
        labels:
          severity: warning
          status: 告警
        annotations:
          summary: "GC次数告警:{{$.Labels.service_name}}--{{$.Labels.instance}}"
          description: "1分钟平均GC次数告警:{{$.Labels.service_name}}--{{$.Labels.instance}}--{{$.Labels.cause}} > 5,(当前:{{$value}})"
 
      - alert: Java error日志告警
        expr: irate(logback_events_total{level="error"}[1m]) > 50
        labels:
          severity: warning
          status: 告警
        annotations:
          summary: "error日志告警:{{$.Labels.service_name}}--{{$.Labels.instance}}"
          description: "1分钟平均error日志数量过多:{{$.Labels.service_name}}--{{$.Labels.instance}} > 50,(当前
:{{$value}})"

以上就是今天和大家分享的关于配置Prometheus 来监控我们实际环境中的常用服务,比如:URL地址、SSL请求、Nginx、MySQL、Kafka、Elasticsearch、Java等。更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。

参考链接:https://blog.csdn.net/weixin_38367535/ category_11425243.html

0 人点赞