云监控最佳实践：自定义监控网络层指标

如何监控部署在云服务器网络层TCP/UDP连接状态指标？

推荐您使用云监控-自定义监控！

目前内测阶段免费使用，无需审核，开通服务即用。诚邀您点击 申请页面 参与内测体验！

本文介绍如何使用 Shell 命令 SDK 方式上报网络层的关键指标数据至自定义监控，并在自定义监控上查看指标和配置告警。

实践背景

定期监控云服务器上网络层的关键指标，当这些监控指标触发您设置的告警条件时发送短信告警。

前提条件

购买了腾讯云云服务器 CVM。
在云服务器安装 Python 2.7以上环境和 pip 工具。

数据上报

步骤1：准备上报环境

1.执行如下命令安装 Python 语言 SDK。

代码语言：javascript复制

pip install tencentcloud-sdk-python

2.在云服务器上创建配置文件~/.ServerMonitor.json配置文件内容如下：

代码语言：txt复制

{
 "SecretId": "xxxxx",
 "SecretKey": "xxxx",
 "Region": "ap-guangzhou"
}

Region：地域，可查询自定义监控可用地域列表。

3.输入如下 Shell 命令，限制该配置文件只有当前管理员有读写权限。

代码语言：txt复制

chmod 0600 ~/.ServerMonitor.json

步骤2：采集并上报数据、

1.新建 ServerMonior.py 文件，用于采集和上报数据，代码片段如下：

代码语言：txt复制

#!/usr/bin/env python
#
# A simple server monitor demo use Tencent cloud PutMonitorData api
import json
import os
import re
import socket
import sys
import time

from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.monitor.v20180724 import monitor_client, models

GLOBAL_CONF = None


def load_conf():
    conf_path = os.path.expanduser("~/.ServerMonitor.json")
    if not os.path.exists(conf_path):
        print("config file %s not found!" % conf_path)
        sys.exit(1)
    config_error_msg = """load config error, sample format:
{
    "SecretId": "xxxxxxx",
    "SecretKey": "xxxxxxx",
    "Region": "ap-guangzhou"
}
    """
    try:
        conf = json.loads(open(conf_path).read())
        if not isinstance(conf, dict):
            raise ValueError("config file format error")
    except:
        print(config_error_msg)
        sys.exit(1)
    if not conf.get("SecretId") or not conf.get("SecretKey") or not conf.get("Region"):
        print(config_error_msg)
        sys.exit(1)
    return conf


def get_lan_ip():
    """
    get lan ip use fake udp connection
    this does not really 'connect' to any server
    """
    # can be any routable address,
    fake_dest = ("10.10.10.10", 53)
    lan_ip = ""
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(fake_dest)
        lan_ip = s.getsockname()[0]
        s.close()
    except Exception, e:
        pass
        # print >>sys.stderr, e
    return lan_ip


class MonitorBase(object):

    def __init__(self, sleep_time):
        self.sleep_time = sleep_time
        self.result1 = None;

    def get_metrics(self):
        """
        collect metrics from system
        return metrics as dict: { "key1":v1, "key2": v2 }
        """
        return {}

    def process(self):
        """
        call get_metrics twice between sleep_time and calc final result to report
        return metrics as dict: { "key1":v1, "key2": v2 }
        """
        result = self.get_metrics()
        if self.sleep_time == 0:
            return result

        self.result1 = result
        time.sleep(self.sleep_time)
        result2 = self.get_metrics()
        metrics = {}
        for key in result2.keys():
            metrics[key] = long(result2[key]) - long(result.get(key, 0))
            # workaround value wrap
            if metrics[key] < 0:
                metrics[key]  = 4294967296
        return metrics

    def report(self):
        """
        report metrics to cloud api
        :return:
        """
        metrics = self.process()
        try:
            cred = credential.Credential(GLOBAL_CONF["SecretId"], GLOBAL_CONF["SecretKey"])
            http_profile = HttpProfile()
            http_profile.endpoint = "monitor.tencentcloudapi.com"

            client_profile = ClientProfile()
            client_profile.httpProfile = http_profile
            client = monitor_client.MonitorClient(cred, GLOBAL_CONF["Region"], client_profile)

            req = models.PutMonitorDataRequest()
            from pprint import pprint
            # limit metrics to report
            metrics_allowed = ["TcpActiveOpens", "TcpPassiveOpens", "TcpAttemptFails", "TcpEstabResets",
                               "TcpRetransSegs", "TcpExtListenOverflows", "UdpInDatagrams", "UdpOutDatagrams",
                               "UdpInErrors", "UdpNoPorts", "UdpSndbufErrors"]
            report_data = {"Metrics": [], "AnnounceInstance": get_lan_ip()}
            for k, v in metrics.items():
                if k in metrics_allowed:
                    report_data["Metrics"].append({"MetricName": k, "Value": v})
            req.from_json_string(json.dumps(report_data))
            pprint(report_data)
            resp = client.PutMonitorData(req)
            print(resp.to_json_string())
        except TencentCloudSDKException as err:
            print(err)


class NetMonitor(MonitorBase):
    """
        parse /proc/net/snmp & /proc/net/netstat
    """

    def get_metrics(self):
        snmp_dict = {}
        snmp_lines = open("/proc/net/snmp").readlines()
        netstat_lines = open("/proc/net/netstat").readlines()
        snmp_lines.extend(netstat_lines)

        sep = re.compile(r'[:s] ')
        n = 0
        for line in snmp_lines:
            n  = 1
            fields = sep.split(line.strip())
            proto = fields.pop(0)
            if n % 2 == 1:
                # header line
                keys = fields
            else:
                # value line
                try:
                    values = [long(f) for f in fields]
                except Exception, e:
                    print e
                kv = dict(zip(keys, values))
                proto_dict = snmp_dict.setdefault(proto, {})
                proto_dict.update(kv)
        return snmp_dict


class NetSnmpIpTcpUdp(NetMonitor):
    """
        Get ip/tcp/udp information from /proc/net/snmp
    """

    def get_metrics(self):
        snmp_dict = super(NetSnmpIpTcpUdp, self).get_metrics()
        metrics = {}
        for proto in ("Tcp", "Ip", "Udp", "Icmp", "TcpExt"):
            if proto not in snmp_dict:
                continue
            for k, v in snmp_dict[proto].items():
                k = proto   k
                metrics[k] = v
        return metrics

    def process(self):
        report_dict = super(NetSnmpIpTcpUdp, self).process()
        # CurrEstab is a tmp value, not inc value
        report_dict['TcpCurrEstab'] = self.result1['TcpCurrEstab']
        return report_dict


if __name__ == "__main__":
    GLOBAL_CONF = load_conf()
    process_dict = {
        NetSnmpIpTcpUdp: 60,
    }
    children = []
    for key in process_dict.keys():
        try:
            pid = os.fork()
        except OSError:
            sys.exit("Unable to create child process!")
        if pid == 0:
            monitor = key(process_dict[key])
            monitor.report()
            sys.exit(0)
        else:
            children.append(pid)

    for i in children:
        os.wait()

代码中 SecretId、SecretKey、Region 等信息需要根据您的实际情况填写。 Region：地域，可查询自定义监控可用地域列表。 SecretId 和 SecretKey，请前往 API 密钥管理获取。

2.下载完后，将 ServerMonior.py 文件放到/usr/local/bin目录下。

3.将 ServerMonior.py 添加到 crontab 计划任务中执行，即可自动完成网络层指标数据上报。

代码语言：txt复制

chmod a x /usr/local/bin/ServerMonitor.py
crontab -l > /tmp/cron.bak
echo "* * * * * /usr/local/bin/ServerMonitor.py &> /tmp/ServerMonitor.log" >> /tmp/cron.bak
crontab /tmp/cron.bak

数据查询

数据上报完成后，可以在指标视图看到刚才上报的数据。

配置告警和接收告警仅做一个监控场景的举例；如需配置网络层上报过的其它指标配置，请执行以下配置告警中的步骤2 - 3。

配置告警

场景：定期监控网络层中 Tcp 连接失败数，当 Tcp 连接失败次数大于0时发送短信告警。

1.确认用户消息通道已验证，可在 CAM 鉴权页面查看验证情况。

2.进入自定义监控指标视图页面，在指标视图右上角选择【···】>【配置告警】。

3.根据背景需求配置告警规则，更详细的配置操作可参见配置告警策略。

如图示例为：Tcp 连接失败数大于0时发送短信告警，持续一个统计周期（1分钟），每5分钟告警一次。

接收告警

如果 Tcp 连接失败数大于0，5分钟后将会收到短信告警，短信内容如下：

代码语言：txt复制

【腾讯云】云监控自定义监控指标告警触发

账号 ID：34xxxxxxxx，昵称：自定义监控

告警详情

告警内容：指标视图 | Tcp连接失败数大于0

告警对象：TcpAttemptFails

当前数据：1

APPID：125xxxxxxx

告警策略：视图告警

触发事件：2019-12-09 22:36:00（UTC 08:00）

指标说明

指标中文名	指标英文名	单位
Tcp 主动连接	TcpActiveOpens	次
Tcp 被动连接	TcpPassiveOpens	次
Tcp 连接失败	TcpAttemptFails	次
Tcp 连接异常断开	TcpEstabResets	次
Tcp 重传的报文段总数	TcpRetransSegs	个
Tcp 监听队列溢出	TcpExtListenOverflows	次
UDP 入包量	UdpInDatagrams	个
UDP 出包量	udpOutDatagrams	个
UDP 入包错误数	udpInErrors	个
UDP 端口不可达	UdpNoPorts	个
UDP 发送缓冲区满	UdpSndbufErrors	次

腾讯云可观测平台云服务器 shell python

0 人点赞