有业务反馈监控基于container_cpu_load_average_10s监控指标在无业务流量的pod统计到的值一直在0-1之间波动,想了解下这里的原因,监控的计算公式为:max by (pod) (container_cpu_load_average_10s{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"}) / 1000 / max by (pod) (kube_pod_container_resource_limits_cpu_cores{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"})
从监控的计算公式看计算结果主要取决于每次采集到的container_cpu_load_average_10s值,接下来看下container_cpu_load_average_10s是如何被计算出来的。
首先看cadvisorupdateLoad函数可以知道loadAvg的值是根据如下公式计算得出:
cd.loadAvg = cd.loadAvg*cd.loadDecay float64(newLoad)*(1.0-cd.loadDecay)
公式的含义就是取的是上一次采集计算出来的值cd.loadAvg乘以计算因子cd.loadDecay,然后加上当前采集
到的newLoad值乘以(1.0-cd.loadDecay)最后得出当前的cd.loadAvg值,cd.loadAvg
再乘以1000就得到container_cpu_load_average_10s的值
那么重点就要看下cd.loadDecay和updateLoad函数的参数newLoad是怎么计算得出的
代码语言:javascript复制// Calculate new smoothed load average using the new sample of runnable threads.
// The decay used ensures that the load will stabilize on a new constant value within
// 10 seconds.
func (cd *containerData) updateLoad(newLoad uint64) {
if cd.loadAvg < 0 {
cd.loadAvg = float64(newLoad) // initialize to the first seen sample for faster stabilization.
} else {
cd.loadAvg = cd.loadAvg*cd.loadDecay float64(newLoad)*(1.0-cd.loadDecay)
}
}
cd.loadDecay的值默认是固定的0.6321205588285577 ,实现方式如下:
代码语言:javascript复制func newContainerData(containerName string, memoryCache *memory.InMemoryCache, handler container.ContainerHandler, logUsage bool, collectorManager collector.CollectorManager, maxHousekeepingInterval time.Duration, allowDynamicHousekeeping bool, clock clock.Clock) (*containerData, error) {
......
cont.loadDecay = math.Exp(float64(-cont.housekeepingInterval.Seconds() / 10))
......
}
从上面代码可以看到cd.loadDecay是由如下计算公式计算得出:
cont.loadDecay = math.Exp(float64(-cont.housekeepingInterval.Seconds() / 10)),
housekeepingInterval.Seconds默认值为10,写一段小代码直接代入相关数字算出cont.loadDecay值。
根据数学公式计算后得出的cont.loadDecay值为0.36787944117144233 , 1.0-cont.loadDecay=0.6321205588285577
代码语言:javascript复制[root@VM-10-5-centos goproject]# cat test.go
package main
import (
"fmt"
"math"
)
func main() {
x := math.Exp(float64(-10 / 10))
fmt.Print(x, "'s exponential value is ",(1.0-x))
}
[root@VM-10-5-centos goproject]# go run test.go
0.36787944117144233's exponential value is 0.6321205588285577
updateLoad(newLoad uint64)函数的newLoad参数值又是怎么获取的?
updateStats在调用updateLoad更新load时会将loadStats.NrRunning作为实参赋值给updateLoad的参数newLoad
代码语言:javascript复制func (cd *containerData) updateStats() error {
stats, statsErr := cd.handler.GetStats()
if statsErr != nil {
// Ignore errors if the container is dead.
if !cd.handler.Exists() {
return nil
}
// Stats may be partially populated, push those before we return an error.
statsErr = fmt.Errorf("%v, continuing to push stats", statsErr)
}
if stats == nil {
return statsErr
}
if cd.loadReader != nil {
// TODO(vmarmol): Cache this path.
path, err := cd.handler.GetCgroupPath("cpu")
if err == nil {
loadStats, err := cd.loadReader.GetCpuLoad(cd.info.Name, path)
if err != nil {
return fmt.Errorf("failed to get load stat for %q - path %q, error %s", cd.info.Name, path, err)
}
stats.TaskStats = loadStats
cd.updateLoad(loadStats.NrRunning)
// convert to 'milliLoad' to avoid floats and preserve precision.
stats.Cpu.LoadAverage = int32(cd.loadAvg * 1000)
}
}
......
}
cadvisor通过如下调用链实现给内核发送request消息,request消息的cmd为CGROUPSTATS_CMD_ATTR_FD
cadvisor通过updateStats->GetCpuLoad->getLoadStats
->prepareCmdMessage
->conn.WriteMessage
发送完消息后cadvisor会通过conn.ReadMessage()等待内核响应并返回消息给cadvisor,cadvisor收到内核对cmd为CGROUPSTATS_CMD_ATTR_FD的响应后结果解析处理获取到容对应cgroup下各状态的进程数量存赋值给LoadStats。
loadStats.NrRunning就对应监控采集时间点有多少个正在running的线程。
代码语言:javascript复制
// This mirrors kernel internal structure.
type LoadStats struct {
// Number of sleeping tasks.
NrSleeping uint64 `json:"nr_sleeping"`
// Number of running tasks.
NrRunning uint64 `json:"nr_running"`
// Number of tasks in stopped state
NrStopped uint64 `json:"nr_stopped"`
// Number of tasks in uninterruptible state
NrUninterruptible uint64 `json:"nr_uninterruptible"`
// Number of tasks waiting on IO
NrIoWait uint64 `json:"nr_io_wait"`
}
// Returns instantaneous number of running tasks in a group.
// Caller can use historical data to calculate cpu load.
// path is an absolute filesystem path for a container under the CPU cgroup hierarchy.
// NOTE: non-hierarchical load is returned. It does not include load for subcontainers.
func (r *NetlinkReader) GetCpuLoad(name string, path string) (info.LoadStats, error) {
if len(path) == 0 {
return info.LoadStats{}, fmt.Errorf("cgroup path can not be empty")
}
cfd, err := os.Open(path)
if err != nil {
return info.LoadStats{}, fmt.Errorf("failed to open cgroup path %s: %q", path, err)
}
defer cfd.Close()
stats, err := getLoadStats(r.familyID, cfd, r.conn)
if err != nil {
return info.LoadStats{}, err
}
klog.V(4).Infof("Task stats for %q: % v", path, stats)
return stats, nil
}
// Get load stats for a task group.
// id: family id for taskstats.
// cfd: open file to path to the cgroup directory under cpu hierarchy.
// conn: open netlink connection used to communicate with kernel.
func getLoadStats(id uint16, cfd *os.File, conn *Connection) (info.LoadStats, error) {
msg := prepareCmdMessage(id, cfd.Fd())
err := conn.WriteMessage(msg.toRawMsg())
if err != nil {
return info.LoadStats{}, err
}
resp, err := conn.ReadMessage()
if err != nil {
return info.LoadStats{}, err
}
parsedmsg, err := parseLoadStatsResp(resp)
if err != nil {
return info.LoadStats{}, err
}
return parsedmsg.Stats, nil
}
/ Prepares message to query task stats for a task group.
func prepareCmdMessage(id uint16, cfd uintptr) (msg netlinkMessage) {
buf := bytes.NewBuffer([]byte{})
addAttribute(buf, unix.CGROUPSTATS_CMD_ATTR_FD, uint32(cfd), 4)
return prepareMessage(id, unix.CGROUPSTATS_CMD_GET, buf.Bytes())
}
// Prepares the message and generic headers and appends attributes as data.
func prepareMessage(headerType uint16, cmd uint8, attributes []byte) (msg netlinkMessage) {
msg.Header.Type = headerType
msg.Header.Flags = syscall.NLM_F_REQUEST
msg.GenHeader.Command = cmd
msg.GenHeader.Version = 0x1
msg.Data = attributes
return msg
}
内核收到cadvisor发送的request消息后,会根据cmd值CGROUPSTATS_CMD_ATTR_FD
统计容器对应cgroup下各状态进程数量填充到cgroupstats返回给cadvisor.
代码语言:javascript复制static const struct genl_ops taskstats_ops[] = {
{
.cmd = TASKSTATS_CMD_GET,
.doit = taskstats_user_cmd,
.policy = taskstats_cmd_get_policy,
.flags = GENL_ADMIN_PERM,
},
{
.cmd = CGROUPSTATS_CMD_GET,
.doit = cgroupstats_user_cmd,
.policy = cgroupstats_cmd_get_policy,
},
};
struct cgroupstats {
__u64 nr_sleeping; /* Number of tasks sleeping */
__u64 nr_running; /* Number of tasks running */
__u64 nr_stopped; /* Number of tasks in stopped state */
__u64 nr_uninterruptible; /* Number of tasks in uninterruptible */
/* state */
__u64 nr_io_wait; /* Number of tasks waiting on IO */
};
static int cgroupstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
{
int rc = 0;
struct sk_buff *rep_skb;
struct cgroupstats *stats;
struct nlattr *na;
size_t size;
u32 fd;
struct fd f;
na = info->attrs[CGROUPSTATS_CMD_ATTR_FD];
if (!na)
return -EINVAL;
fd = nla_get_u32(info->attrs[CGROUPSTATS_CMD_ATTR_FD]);
f = fdget(fd);
if (!f.file)
return 0;
size = nla_total_size(sizeof(struct cgroupstats));
rc = prepare_reply(info, CGROUPSTATS_CMD_NEW, &rep_skb,
size);
if (rc < 0)
goto err;
na = nla_reserve(rep_skb, CGROUPSTATS_TYPE_CGROUP_STATS,
sizeof(struct cgroupstats));
if (na == NULL) {
nlmsg_free(rep_skb);
rc = -EMSGSIZE;
goto err;
}
stats = nla_data(na);
memset(stats, 0, sizeof(*stats));
rc = cgroupstats_build(stats, f.file->f_path.dentry);
if (rc < 0) {
nlmsg_free(rep_skb);
goto err;
}
rc = send_reply(rep_skb, info);
err:
fdput(f);
return rc;
}
**
* cgroupstats_build - build and fill cgroupstats
* @stats: cgroupstats to fill information into
* @dentry: A dentry entry belonging to the cgroup for which stats have
* been requested.
*
* Build and fill cgroupstats so that taskstats can export it to user
* space.
*/
int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
{
struct kernfs_node *kn = kernfs_node_from_dentry(dentry);
struct cgroup *cgrp;
struct css_task_iter it;
struct task_struct *tsk;
/* it should be kernfs_node belonging to cgroupfs and is a directory */
if (dentry->d_sb->s_type != &cgroup_fs_type || !kn ||
kernfs_type(kn) != KERNFS_DIR)
return -EINVAL;
mutex_lock(&cgroup_mutex);
/*
* We aren't being called from kernfs and there's no guarantee on
* @kn->priv's validity. For this and css_tryget_online_from_dir(),
* @kn->priv is RCU safe. Let's do the RCU dancing.
*/
rcu_read_lock();
cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
if (!cgrp || cgroup_is_dead(cgrp)) {
rcu_read_unlock();
mutex_unlock(&cgroup_mutex);
return -ENOENT;
}
rcu_read_unlock();
css_task_iter_start(&cgrp->self, 0, &it);
while ((tsk = css_task_iter_next(&it))) {
switch (tsk->state) {
case TASK_RUNNING:
stats->nr_running ;
break;
case TASK_INTERRUPTIBLE:
stats->nr_sleeping ;
break;
case TASK_UNINTERRUPTIBLE:
stats->nr_uninterruptible ;
break;
case TASK_STOPPED:
stats->nr_stopped ;
break;
default:
if (delayacct_is_task_waiting_on_io(tsk))
stats->nr_io_wait ;
break;
}
}
css_task_iter_end(&it);
mutex_unlock(&cgroup_mutex);
return 0;
}
分析完container_cpu_load_average_10s是如何获取的,我们在实际场景来验证下结果:
部署脚本定期采集裸数据container_cpu_load_average_10s
代码语言:javascript复制#cat get-container_cpu_load_average_10s.sh
#!/bin/bash
while true
do
date
kubectl get --raw /api/v1/nodes/eklet-subnet-g2wkclr1/proxy/metrics/cadvisor | grep load | grep fb88e098-31b2-4d5c-bbcf-5257361abc1f | grep -E 'web|shell'
sleep 0.5
done
以图示为例,采集到的container_cpu_load_average_10数值为632
根据代码算法,当监控上一次采集container_cpu_load_average_10s时刻采集到的running线程数为0时,10秒后下一时刻采集到running线程数为1时,这里算出来container_cpu_load_average_10s的值为: cd.loadAvg = cd.loadAvg*cd.loadDecay float64(newLoad)*(1.0-cd.loadDecay)=0*0.36787944117144233 1*0.6321205588285577=0.632
container_cpu_load_average_10s=0.632*1000=632
当pod设置的cpu limit为2C,根据如下监控计算公式得出可以算出监控最终看到的load值为0.316:
max by (pod) (container_cpu_load_average_10s{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"}) / 1000 / max by (pod) (kube_pod_container_resource_limits_cpu_cores{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"})
632/1000/kube_pod_container_resource_limits_cpu_cores=632/1000/2=0.316
对应监控采集到的值:
注:内核源码提供了工具获取cgroup running进程,可通过内核源码自带工具tools/accounting/getdelays.c 获取对应值(详细参考:https://utcc.utoronto.ca/~cks/space/blog/linux/LoadAverageWhereFrom)
参考:https://github.com/google/cadvisor/issues/2286