3.4 告警功能
3.4.1 告警功能简介
Skywalking每隔一段时间根据收集到的链路追踪的数据和配置的告警规则(如服务响应时间、服务响应 时间百分比)等,判断如果达到阈值则发送相应的告警信息。发送告警信息是通过调用webhook接口完 成,具体的webhook接口可以使用者自行定义,从而开发者可以在指定的webhook接口中编写各种告 警方式,比如邮件、短信等。告警的信息也可以在RocketBot中查看到。
以下是默认的告警规则配置,位于skywalking安装目录下的config文件夹下 alarm-settings.yml文件 中:
代码语言:javascript复制rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_p90_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_p90
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
以上文件定义了默认的4种规则
- 最近3分钟内服务的平均响应时间超过1秒
- 最近2分钟服务成功率低于80%
- 最近3分钟90%服务响应时间超过1秒
- 最近2分钟内服务实例的平均响应时间超过1秒 规则中的参数属性如下
属性参照表
属性 | 含义 |
---|---|
metrics-name | oal脚本中的度量名称 |
threshold | 阈值,与metrics-name和下面的比较符号相匹配 |
op | 比较操作符,可以设定>,<,= |
period | 多久检查一次当前的指标数据是否符合告警规则,单位分钟 |
count | 达到多少次后,发送告警消息 |
silence-period | 在多久之内,忽略相同的告警消息 |
message | 告警消息内容 |
include-names | 本规则告警生效的服务列表 |
webhooks可以配置告警产生时的调用地址。
3.4.2 告警功能测试代码
编写告警功能接口来进行测试,创建skywalking_alarm项目。
AlarmController
代码语言:javascript复制import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
public class AlarmController {
//每次调用睡眠1.5秒,模拟超时的报警
@GetMapping("/timeout")
public String timeout(){
try {
Thread.sleep(1500);
} catch (InterruptedException e) {
e.printStackTrace();
}
return "timeout";
}
}
该接口主要用于模拟超时,多次调用之后就可以生成告警信息。
WebHooks
代码语言:javascript复制import com.sf.saas.skywalking_alarm.pojo.AlarmMessage;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;
import java.util.ArrayList;
import java.util.List;
@RestController
public class WebHooks {
private List<AlarmMessage> lastList = new ArrayList<>();
@PostMapping("/webhook")
public void webhook(@RequestBody List<AlarmMessage> alarmMessageList){
lastList = alarmMessageList;
}
@GetMapping("/show")
public List<AlarmMessage> show(){
return lastList;
}
}
代码语言:javascript复制产生告警时会调用webhook接口,该接口必须是Post类型,同时接口参数使用RequestBody。参 数格式为:
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceA",
"id0": 12,
"id1": 0,
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage xxxx",
"startTime": 1560524171000
}, {
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceB",
"id0": 23,
"id1": 0,
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage yyy",
"startTime": 1560524171000
}]
AlarmMessage
代码语言:javascript复制public class AlarmMessage {
private int scopeId;
private String name;
private int id0;
private int id1;
//告警的消息
private String alarmMessage;
//告警的产生时间
private long startTime;
public int getScopeId() {
return scopeId;
}
public void setScopeId(int scopeId) {
this.scopeId = scopeId;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getId0() {
return id0;
}
public void setId0(int id0) {
this.id0 = id0;
}
public int getId1() {
return id1;
}
public void setId1(int id1) {
this.id1 = id1;
}
public String getAlarmMessage() {
return alarmMessage;
}
public void setAlarmMessage(String alarmMessage) {
this.alarmMessage = alarmMessage;
}
public long getStartTime() {
return startTime;
}
public void setStartTime(long startTime) {
this.startTime = startTime;
}
@Override
public String toString() {
return "AlarmMessage{"
"scopeId=" scopeId
", name='" name '''
", id0=" id0
", id1=" id1
", alarmMessage='" alarmMessage '''
", startTime=" startTime
'}';
}
}
实体类用于接口告警信息
3.4.3 部署测试
首先需要修改告警规则配置文件,将webhook地址修改为
代码语言:javascript复制webhooks:
- http://127.0.0.1:8089/webhook
然后重启skywalking 1、将 skywalking_alarm.jar上传至 /usr/local/skywalking目录下。
2、启动skywalking_alarm应用,等待启动成功。
代码语言:javascript复制java -javaagent:/usr/local/skywalking/apache-skywalking-apm-
bin/agent/skywalking-agent.jar -Dskywalking.agent.service_name=skywalking_alarm -jar skywalking_alarm.jar
3、不停调用接口,接口地址为:http://虚拟机IP:8089/timeout
4、直到出现告警:
5、查看告警信息接口:http://虚拟机IP:8089/show
从上图中可以看到,我们已经获取到了告警相关的信息,在生产中使用可以在webhook接口中对接短 信、邮件等平台,当告警出现时能迅速发送信息给对应的处理人员,提高故障处理的速度。