URL监控通过blackbox-exporter组件监控,组件部署位置192.168.0.39。
创建组件配置文件
代码语言:javascript复制vim /data/prometheus_dir/blackbox_exporter/blackbox.yml
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^ OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ] )"
send: "PONG ${1}"
- expect: "^:[^ ] 001"
icmp:
prober: icmp
启动组件容器
代码语言:javascript复制docker run -d
-p 9300:9115
--name blackbox_exporter
--restart=always
--restart=on-failure:5
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
-v /data/prometheus_dir/blackbox_exporter/blackbox.yml:/config/blackbox.yml
prom/blackbox-exporter:master
--config.file=/config/blackbox.yml
prometheus集成blackbox组件
prometheus.yml 添加如下配置:
代码语言:javascript复制# http检测配置
- job_name: 'blackbox'
scrape_interval: 10s
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.baidu.com
- https://www.aliyun.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.0.39:9300
重启prometheus
更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
添加告警规则
http_export-alert-rules.yaml
代码语言:javascript复制groups:
- name: nginx状态-监控告警
rules:
- alert: 状态码检测
expr: probe_http_status_code{job="blackbox"} != 200
for: 0m
labels:
severity: warning
status: 非常严重
annotations:
summary: "请求URL状态码非200"
description: "请求{{$.Labels.instance}}状态码非200"
- alert: 证书过期时间检测
expr: probe_ssl_earliest_cert_expiry {job="blackbox"} -time() < 86400 * 30
for: 5m
labels:
serverity: warning
status: 警告
annotations:
summary: "证书过期时间不足30天"
description: "{{$.Labels.instance}}证书还有30天到期,请及时更换"
- alert: 页面响应时间检测
expr: probe_duration_seconds{job="blackbox"} >= 1
for: 1m
labels:
serverity: warning
status: 警告
annotations:
summary: "页面响应时间超过1秒"
description: "{{$.Labels.instance}}页面响应时间超过1秒"
重启prometheus生效,添加grafana图形https://grafana.com/grafana/dashboards/7587。
post请求监控
因为post请求的headers头和body不同,所以要想监控post请求就需要根据headers和body做自定义模块。举例:
代码语言:javascript复制url:https://aaa.bbb.com/api
headers:
userid:1111111
body:
{"templateKey":"AD_MA","ext":{"skuId":"-1"}}
可以用postman工具去进行请求测试一下是否能正常返回接口内容。
可以看到返回数据正常,接下来就可以配置模块了。
代码语言:javascript复制vim /data/prometheus_dir/blackbox_exporter/blackbox.yml
POST_api: # 模块名称
prober: http
timeout: 30s
http:
method: POST
headers:
Content-Type: application/json
userid: 111111
body: {"templateKey":"AD_MA","ext":{"skuId":"-1"}}
vim prometheus.yml 监控集成到prometheus中
代码语言:javascript复制 - job_name: 'blackbox_POST_api'
scrape_interval: 20s
metrics_path: /probe
params:
module: [POST_api] # 匹配模块名称
static_configs:
- targets:
- https://aaa.bbb.com
labels:
url_name: "POST xxxxapi" # 自定义的一个标签,后续可以直接显示到告警中方便查看
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.0.39:9300
prometheus重启后会出现刚刚加入的监控,这里名称是我真实的,所以不匹配,无需在意。
更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
告警
下面是两个告警规则,因为有的url接口就是慢,也不影响业务,所以统一配置成超过1秒告警,会导致某些时间频繁报警,所以可以根据 =~ 和 = 的方法配置某一个规则大于1.5秒才告警。
代码语言:javascript复制- alert: 页面响应时间检测
expr: probe_duration_seconds{job="blackbox_POST_choiceList"} >= 1.5
for: 1m
labels:
serverity: warning
status: 警告
annotations:
summary: "{{$.Labels.instance}}页面响应时间超过2秒"
description: "服务:{{$.Labels.url_name}}---响应时间>=1.5s,(当前:{{$value}})"
代码语言:javascript复制- alert: 页面响应时间检测
expr: probe_duration_seconds{job=~"blackbox.*",job!="blackbox_POST_choiceList"} >= 1
for: 1m
labels:
serverity: warning
status: 警告
annotations:
summary: "{{$.Labels.instance}}页面响应时间超过1秒"
description: "服务:{{$.Labels.url_name}}---响应时间>=1s,(当前:{{$value}})"
告警消息
prometheus 配置监控 nginx
github地址:https://github.com/nginxinc/nginx-prometheus-exporter
环境
- 组件nginx-prometheus-exporter部署位置:192.168.0.39
- nginx服务器:172.30.0.10
- 通过组件nginx-prometheus-exporter进行监控
nginx需要安装有with-http_stub_status_module
模块,一般高版本的nginx会自带这个模块。检查命令如下:
# nginx -V 2>&1 | grep -o with-http_stub_status_modulewith-http_stub_status_module
nginx增加配置
创建一个server,监听38888端口,开启监控,只允许192.168.0.39访问。
代码语言:javascript复制server {
listen 38888;
location /nginx_status {
stub_status on;
allow 192.168.0.39; #only allow requests from localhost
deny all; #deny all other hosts
}
}
测试一下:172.30.0.10就是刚刚那台nginx主机。
代码语言:javascript复制# curl http://172.30.0.10:38868/nginx_status
Active connections: 12189
server accepts handled requests
195544839 195544839 1147258694
Reading: 0 Writing: 63 Waiting: 12018
创建启动监控组件
需要指定被监控主机,我启动了多个组件,端口:9117 自定义的。
代码语言:javascript复制docker run -d
--name nginx_exporter_qalb_10
-m 1g
--restart=always
--restart=on-failure:5
-p 9117:9113
nginx/nginx-prometheus-exporter:0.10.0
-nginx.scrape-uri http://172.30.0.10:38888/nginx_status
prometheus集成nginx-prometheus-exporter组件
prometheus.yml 添加如下配置:
代码语言:javascript复制# nginx
- job_name: nginx-qalb-10
static_configs:
- targets: ['192.168.0.10:9117']
labels:
instance: nginx-qalb-10
有多个组件就再加一个,注意端口别冲突
- job_name: nginx-qalb-11
static_configs:
- targets: ['192.168.0.10:9118']
labels:
instance: nginx-qalb-11
重启prometheus容器
添加告警规则
nginx_export-alert-rules.yaml
代码语言:javascript复制groups:
- name: nginx状态-监控告警
rules:
- alert: nginx状态
expr: nginx_up == 0
for: 1s
labels:
serverity: warning
status: 非常严重
annotations:
summary: "{{$labels.instance}}:nginx服务停止"
description: "nginx服务down"
重启prometheus容器生效。更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
prometheus配置监控SSL请求
为啥监控ssl请求时间
公司服务在云环境下,nginx的负载使用的是云上负载,但未配置https证书托管,而是将证书放到负载后端的每台nginx上,nginx也未做ssl证书相关优化,所以当并发达到一定量时,可能会出现某一台nginx服务器ssl请求非常慢。
每次解决需要绑定hosts去curl每一个节点,才能判断出哪台有问题。为了快速报警哪台服务器ssl握手慢,所以采用监控方式进行探测及告警。
最开始使用了三台服务器对三台nginx进行hosts绑定,然后编写py脚本进行告警,目的能达到,但时很不方便,所以想到用docker容器进行hosts绑定,通过prometheus调用的方式采集结果。
- 域名:https://www.aaa.com
- nginx服务器:192.168.100.1 192.168.100.2
- 环境:docker python
安装模块
代码语言:javascript复制pip install prometheus_client
pip install flask
探测脚本:
代码语言:javascript复制# cat nginx-ssl-check.py
import os
import re
import prometheus_client
from prometheus_client import Gauge
from flask import Response, Flask
app = Flask(__name__)
SSL = Gauge('SSL_handshake', 'SSL_handshake')
@app.route("/metrics")
def ssl_handshake():
num = os.popen('curl -w "TCP handshake: %{time_connect}, SSL handshake: %{time_appconnect}n" -so /dev/null https://www.aaa.com/').read()
SSL_handshake = re.findall(r"SSL handshake: (. )", num)
f_SSL = float(SSL_handshake[0])
SSL.set(f_SSL)
return Response(prometheus_client.generate_latest(SSL), mimetype="text/plain")
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)
启动运行后,访问
代码语言:javascript复制# curl http://localhost:8000/metrics
结果如下
代码语言:javascript复制# HELP SSL_handshake SSL_handshake
# TYPE SSL_handshake gauge
SSL_handshake 0.124363 这个就是ssl的请求时间
使用docker启动
先构建一个镜像,安装模块
代码语言:javascript复制# cat Dockerfile
FROM python:3.9.13
RUN /usr/local/bin/python -m pip install --upgrade pip
RUN pip3 install prometheus_client
RUN pip3 install flask
CMD python3 /data/nginx-ssl-check.py
构建镜像
代码语言:javascript复制docker build -t promehtues_flask_py:v1 .
将上面的python脚本放到服务器目录中,这样是为了多个容器可以同时使用一个脚本,脚本目录:
代码语言:javascript复制/data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py
启动容器:
代码语言:javascript复制docker run -d
-p 8000:8000
--name nginx-ssl-check-192.168.100.1
--restart=always
--restart=on-failure:5
--add-host www.aaa.com:192.168.100.1
-v /data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py:/data/nginx-ssl-check.py
promehtues_flask_py:v1
代码语言:javascript复制docker run -d
-p 8001:8000
--name nginx-ssl-check-192.168.100.2
--restart=always
--restart=on-failure:5
--add-host www.aaa.com:192.168.100.2
-v /data/prometheus_dir/nginx_ssl_check/nginx-ssl-check.py:/data/nginx-ssl-check.py
promehtues_flask_py:v1
注意容器中需要绑定hosts,用于探测对应主机的ssl,而不是负载的方式探测,负载方式是无法知道当前是哪一台nginx的ssl返回慢的。
测试一下:
- curl http://localhost:8000/metrics
- curl http://localhost:8001/metrics
prometheus集成
代码语言:javascript复制# nginx ssl 握手时间检测
- job_name: nginx_ssl_check-192.168.100.1
scrape_interval: 5s
static_configs:
- targets: ['192.168.100.200:8000']
labels:
instance: nginx-ssl-check-192.168.100.1
- job_name: nginx_ssl_check-192.168.100.2
scrape_interval: 5s
static_configs:
- targets: ['192.168.100.200:8001']
labels:
instance: nginx-ssl-check-192.168.100.2
重启prometheus
告警
代码语言:javascript复制# cat rules/nginx_ssl_check-rules.yaml
groups:
- name: nginx-ssl请求-监控告警
rules:
- alert: ssl请求告警
expr: SSL_handshake > 3
for: 0m
labels:
severity: warning
status: 非常严重
annotations:
summary: "ssl请求:{{$.Labels.instance}}超过3秒"
description: "ssl请求:{{$.Labels.instance}}---超过3秒,(当前:{{$value}})"
更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
prometheus配置监控 Kafka
- 监控组件:kafka-exporter
- github地址:GitHub - imduffy15/kafka_exporter: Kafka exporter for Prometheus
启动:
代码语言:javascript复制docker run -d
--restart=always
--restart=on-failure:5
--name kafka_exporter
-v /etc/localtime:/etc/localtime
-p 9308:9308
danielqsj/kafka-exporter:v1.2.0
--kafka.server=172.30.0.11:9092
prometheus集成kafka_exporter
代码语言:javascript复制vim prometheus.yml
# kafka 监控
- job_name: 'kafka-172.30.0.11'
scrape_interval: 10s
static_configs:
- targets: ['192.168.0.39:9308']
labels:
kafka_ip: 'kafka-172.30.0.11'
重启prometheus容器生效。
grafana码:7589 https://grafana.com/grafana/dashboards/7589
告警规则
代码语言:javascript复制# cat rules/kafka-export-alert-rules.yaml
groups:
- name: kafka消费滞后告警
rules:
- alert: kafka消费滞后
expr: sum(kafka_consumergroup_lag{topic!="sop_free_study_fix-student_wechat_detail"}) by (consumergroup, topic) > 1000
for: 3m
labels:
serverity: warning
status: 严重
annotations:
summary: "kafka消费滞后"
description: "{{$.Labels.consumergroup}}##{{$.Labels.topic}}:消费滞后超过1000持续3分钟(当前:{{$value}})"
- alert: kafka-exporter down
expr: kafka_exporter_build_info < 1
for: 3m
labels:
serverity: warning
status: 严重
annotations:
summary: "kafka-exporter down"
description: "kafka-exporter down {{$.Labels.instance}}"
- alert: kafka server down
expr: kafka_brokers < 1
for: 3m
labels:
serverity: warning
status: 严重
annotations:
summary: "kafka server down"
description: "kafka server down {{$.Labels.job}}"
prometheus配置监控 Mysql
- 监控组件:mysqld-exporter
- github地址:GitHub - prometheus/mysqld_exporter: Exporter for MySQL server metrics
- mysqld-exporter 部署位置 192.168.0.39
- 被监控mysql部署位置 192.168.0.10
被监控数据库添加用户权限
代码语言:javascript复制CREATE USER 'mysql_exporter'@'192.168.0.39' IDENTIFIED BY '111111';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'mysql_exporter'@'192.168.0.10';
flush privileges;
启动mysqld-exporter
代码语言:javascript复制docker run -d
--name mysql-192.168.0.10
-p 9510:9104
--restart=always
--restart=on-failure:5
-e DATA_SOURCE_NAME="mysql_exporter:111111@(192.168.0.10:3306)/"
prom/mysqld-exporter
prometheus集成mysqld-exporter
# mysqld_exporter
- job_name: mysql-192.168.0.10
static_configs:
- targets: ['192.168.0.39:9510']
grafana图形代码 7362
告警配置
我这里只包含了主从的
代码语言:javascript复制# cat rules/mysql_export-alert-rules.yaml
groups:
- name: mysql主从监控告警
rules:
- alert: mysql主从Slave_IO告警
expr: mysql_slave_status_slave_io_running == 0
for: 1s
labels:
serverity: warning
status: 非常严重
annotations:
description: "{{$labels.job}}:mysql主从Slave_IO停止"
summary: "mysql主从Slave_IO停止"
- alert: mysql主从Slave_SQL告警
expr: mysql_slave_status_slave_sql_running == 0
for: 1s
labels:
serverity: warning
status: 非常严重
annotations:
description: "{{$labels.job}}:mysql主从Slave_SQL停止"
summary: "mysql主从Slave_SQL停止"
- alert: mysql主从延时告警
expr: mysql_slave_status_seconds_behind_master > 60
for: 3m
labels:
serverity: warning
status: 非常严重
annotations:
description: "{{$labels.job}}:mysql主从延时>60s,(当前:{{$value}})"
summary: "mysql主从Slave_SQL停止"
更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
prometheus配置监控 ElasticSearch
- 组件名称:elasticsearch-exporter
- github地址:GitHub - prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus
- 监控目标:192.168.0.100、192.168.0.101、192.168.0.102集群
- 组件部署位置:192.168.0.39
docker-cpmpose启动
代码语言:javascript复制cat /data/docker-compose_dir/elastic/docker-compose.yml
version: '3'
services:
elasticsearch_exporter-192.168.0.100:
image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
command:
- '--es.uri=http://192.168.0.100:9200'
restart: always
ports:
- "0.0.0.0:9600:9114"
elasticsearch_exporter-192.168.0.101:
image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
command:
- '--es.uri=http://192.168.0.101:9200'
restart: always
ports:
- "0.0.0.0:9601:9114"
elasticsearch_exporter-192.168.0.102:
image: quay.io/prometheuscommunity/elasticsearch-exporter:latest
command:
- '--es.uri=http://192.168.0.102:9200'
restart: always
ports:
- "0.0.0.0:9602:9114"
启动
代码语言:javascript复制docker-compose up -d
prometheus集成组件
代码语言:javascript复制# elasticsearch_exporter
- job_name: elastic-192.168.0.100
scrape_interval: 15s
static_configs:
- targets: ['192.168.0.39:9600']
- job_name: elastic-192.168.0.101
scrape_interval: 15s
static_configs:
- targets: ['192.168.0.39:9601']
- job_name: elastic-192.168.0.102
scrape_interval: 15s
static_configs:
- targets: ['192.168.0.39:9602']
告警配置
代码语言:javascript复制# cat elastic-rules.yaml
groups:
- name: ElasticSearch-监控告警
rules:
- alert: 集群节点数减少告警
expr: elasticsearch_cluster_health_number_of_nodes < 4
for: 5m
labels:
severity: warning
status: 非常严重
annotations:
summary: "ES集群节点数减少:{{$.Labels.job}}"
description: "ES集群节点数减少:{{$.Labels.job}},(当前:{{$value}})"
- alert: jvm内存使用率告警
expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9
for: 5m
labels:
severity: warning
status: 非常严重
annotations:
summary: "jvm内存使用率过高:{{$.Labels.job}}"
description: "jvm内存使用率过高:{{$.Labels.job}} => 0.9,(当前:{{$value}})"
重启prometheus
代码语言:javascript复制docker restart prometheus
grafana图形:2322 具体需要微调一下。
prometheus配置监控 Java 服务
让开发在springboot项目集成Micrometer。完成后访问试试
代码语言:javascript复制# curl http://ip:port/actuator/prometheus
会出现很多数据就是正常的。否则无法操作下面的步骤了。
prometheus集成
代码语言:javascript复制# java
- job_name: java
scrape_interval: 10s
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['172.30.0.8:8986',
'172.30.0.11:8986',
'172.30.0.21:8986']
labels:
service_name: 'aaa'
- targets: ['172.30.0.18:18278',
'172.30.0.25:18278',
'172.30.0.36:18278']
labels:
service_name: 'bbb'
grafana图形码:6756 12856,主用12856,中间插点6756。
告警规则
代码语言:javascript复制# cat /data/prometheus_dir/rules/java-rules.yaml
groups:
- name: JAVA服务-监控告警
rules:
- alert: Java 服务停止告警
expr: up{job="java"} == 0
for: 1m
labels:
severity: warning
status: 非常严重
annotations:
summary: "服务停止:{{$.Labels.service_name}}--{{$.Labels.instance}}"
description: "服务停止:{{$.Labels.service_name}}--{{$.Labels.instance}},(当前
:{{$value}})"
- alert: Java 接口延迟告警
expr: irate(http_server_requests_seconds_sum{ job="java",exception="None", uri!~".*actuator.*"}[1m]) / irate(http_server_requests_seconds_count{ job="java",exception="None", uri!~".*actuator.*"}[1m]) > 3
for: 1m
labels:
severity: warning
status: 非常严重
annotations:
summary: "接口延迟:{{$.Labels.job}}"
description: "接口延迟:{{$.Labels.service_name}}--{{$.Labels.instance}} > 3s,(当前:{{$value}})"
- alert: Java 接口状态码告警
expr: http_server_requests_seconds_count{job="java",uri!="/**",status!='200'}
for: 1m
labels:
severity: warning
status: 非常严重
annotations:
summary: "接口状态码异常:{{$.Labels.service_name}}--{{$.Labels.instance}}"
description: "接口状态码异常:{{$.Labels.service_name}}--{{$.Labels.instance}}--{{$.Labels.method}}--{{$.Labels.uri}},(当前:{{$.Labels.status}})"
- alert: Java GC次数告警
expr: irate(jvm_gc_pause_seconds_count{job="java"}[1m]) > 5
labels:
severity: warning
status: 告警
annotations:
summary: "GC次数告警:{{$.Labels.service_name}}--{{$.Labels.instance}}"
description: "1分钟平均GC次数告警:{{$.Labels.service_name}}--{{$.Labels.instance}}--{{$.Labels.cause}} > 5,(当前:{{$value}})"
- alert: Java error日志告警
expr: irate(logback_events_total{level="error"}[1m]) > 50
labels:
severity: warning
status: 告警
annotations:
summary: "error日志告警:{{$.Labels.service_name}}--{{$.Labels.instance}}"
description: "1分钟平均error日志数量过多:{{$.Labels.service_name}}--{{$.Labels.instance}} > 50,(当前
:{{$value}})"
以上就是今天和大家分享的关于配置Prometheus 来监控我们实际环境中的常用服务,比如:URL地址、SSL请求、Nginx、MySQL、Kafka、Elasticsearch、Java等。更多关于企业级监控平台系列的学习文章,请参阅:构建企业级监控平台,本系列持续更新中。
参考链接:https://blog.csdn.net/weixin_38367535/ category_11425243.html