container_cpu_load_average_10s是如何统计的

有业务反馈监控基于container_cpu_load_average_10s监控指标在无业务流量的pod统计到的值一直在0-1之间波动,想了解下这里的原因，监控的计算公式为：max by (pod) (container_cpu_load_average_10s{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"}) / 1000 / max by (pod) (kube_pod_container_resource_limits_cpu_cores{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"})

从监控的计算公式看计算结果主要取决于每次采集到的container_cpu_load_average_10s值，接下来看下container_cpu_load_average_10s是如何被计算出来的。

首先看cadvisorupdateLoad函数可以知道loadAvg的值是根据如下公式计算得出：

cd.loadAvg = cd.loadAvg*cd.loadDecay float64(newLoad)*(1.0-cd.loadDecay)

公式的含义就是取的是上一次采集计算出来的值cd.loadAvg乘以计算因子cd.loadDecay，然后加上当前采集

到的newLoad值乘以(1.0-cd.loadDecay)最后得出当前的cd.loadAvg值，cd.loadAvg

再乘以1000就得到container_cpu_load_average_10s的值

那么重点就要看下cd.loadDecay和updateLoad函数的参数newLoad是怎么计算得出的

代码语言：javascript复制

// Calculate new smoothed load average using the new sample of runnable threads.
// The decay used ensures that the load will stabilize on a new constant value within
// 10 seconds.
func (cd *containerData) updateLoad(newLoad uint64) {
        if cd.loadAvg < 0 {
                cd.loadAvg = float64(newLoad) // initialize to the first seen sample for faster stabilization.
        } else {
                cd.loadAvg = cd.loadAvg*cd.loadDecay   float64(newLoad)*(1.0-cd.loadDecay)
        }
}

cd.loadDecay的值默认是固定的0.6321205588285577 ，实现方式如下：

代码语言：javascript复制

func newContainerData(containerName string, memoryCache *memory.InMemoryCache, handler container.ContainerHandler, logUsage bool, collectorManager collector.CollectorManager, maxHousekeepingInterval time.Duration, allowDynamicHousekeeping bool, clock clock.Clock) (*containerData, error) {
    ......
    cont.loadDecay = math.Exp(float64(-cont.housekeepingInterval.Seconds() / 10))
    ......
}

从上面代码可以看到cd.loadDecay是由如下计算公式计算得出：

cont.loadDecay = math.Exp(float64(-cont.housekeepingInterval.Seconds() / 10))，

housekeepingInterval.Seconds默认值为10，写一段小代码直接代入相关数字算出cont.loadDecay值。

根据数学公式计算后得出的cont.loadDecay值为0.36787944117144233 ， 1.0-cont.loadDecay=0.6321205588285577

代码语言：javascript复制

[root@VM-10-5-centos goproject]# cat test.go
package main

import (
        "fmt"
        "math"
)

func main() {
        x := math.Exp(float64(-10 / 10))
        fmt.Print(x, "'s exponential value is ",(1.0-x))
}
[root@VM-10-5-centos goproject]# go run test.go
0.36787944117144233's exponential value is 0.6321205588285577

updateLoad(newLoad uint64)函数的newLoad参数值又是怎么获取的?

updateStats在调用updateLoad更新load时会将loadStats.NrRunning作为实参赋值给updateLoad的参数newLoad

代码语言：javascript复制

func (cd *containerData) updateStats() error {
        stats, statsErr := cd.handler.GetStats()
        if statsErr != nil {
                // Ignore errors if the container is dead.
                if !cd.handler.Exists() {
                        return nil
                }

                // Stats may be partially populated, push those before we return an error.
                statsErr = fmt.Errorf("%v, continuing to push stats", statsErr)
        }
        if stats == nil {
                return statsErr
        }
        if cd.loadReader != nil {
                // TODO(vmarmol): Cache this path.
                path, err := cd.handler.GetCgroupPath("cpu")
                if err == nil {
                        loadStats, err := cd.loadReader.GetCpuLoad(cd.info.Name, path)
                        if err != nil {
                                return fmt.Errorf("failed to get load stat for %q - path %q, error %s", cd.info.Name, path, err)
                        }
                        stats.TaskStats = loadStats
                        cd.updateLoad(loadStats.NrRunning)
                        // convert to 'milliLoad' to avoid floats and preserve precision.
                        stats.Cpu.LoadAverage = int32(cd.loadAvg * 1000)
                }
        }
       ......
    }

cadvisor通过如下调用链实现给内核发送request消息,request消息的cmd为CGROUPSTATS_CMD_ATTR_FD

cadvisor通过updateStats->GetCpuLoad->getLoadStats

->prepareCmdMessage

->conn.WriteMessage

发送完消息后cadvisor会通过conn.ReadMessage()等待内核响应并返回消息给cadvisor，cadvisor收到内核对cmd为CGROUPSTATS_CMD_ATTR_FD的响应后结果解析处理获取到容对应cgroup下各状态的进程数量存赋值给LoadStats。

loadStats.NrRunning就对应监控采集时间点有多少个正在running的线程。

代码语言：javascript复制


// This mirrors kernel internal structure.
type LoadStats struct {
        // Number of sleeping tasks.
        NrSleeping uint64 `json:"nr_sleeping"`

        // Number of running tasks.
        NrRunning uint64 `json:"nr_running"`

        // Number of tasks in stopped state
        NrStopped uint64 `json:"nr_stopped"`

        // Number of tasks in uninterruptible state
        NrUninterruptible uint64 `json:"nr_uninterruptible"`

        // Number of tasks waiting on IO
        NrIoWait uint64 `json:"nr_io_wait"`
}

// Returns instantaneous number of running tasks in a group.
// Caller can use historical data to calculate cpu load.
// path is an absolute filesystem path for a container under the CPU cgroup hierarchy.
// NOTE: non-hierarchical load is returned. It does not include load for subcontainers.
func (r *NetlinkReader) GetCpuLoad(name string, path string) (info.LoadStats, error) {
        if len(path) == 0 {
                return info.LoadStats{}, fmt.Errorf("cgroup path can not be empty")
        }

        cfd, err := os.Open(path)
        if err != nil {
                return info.LoadStats{}, fmt.Errorf("failed to open cgroup path %s: %q", path, err)
        }
        defer cfd.Close()

        stats, err := getLoadStats(r.familyID, cfd, r.conn)
        if err != nil {
                return info.LoadStats{}, err
        }
        klog.V(4).Infof("Task stats for %q: % v", path, stats)
        return stats, nil
}


// Get load stats for a task group.
// id: family id for taskstats.
// cfd: open file to path to the cgroup directory under cpu hierarchy.
// conn: open netlink connection used to communicate with kernel.
func getLoadStats(id uint16, cfd *os.File, conn *Connection) (info.LoadStats, error) {
        msg := prepareCmdMessage(id, cfd.Fd())
        err := conn.WriteMessage(msg.toRawMsg())
        if err != nil {
                return info.LoadStats{}, err
        }

        resp, err := conn.ReadMessage()
        if err != nil {
                return info.LoadStats{}, err
        }

        parsedmsg, err := parseLoadStatsResp(resp)
        if err != nil {
                return info.LoadStats{}, err
        }
        return parsedmsg.Stats, nil
}

/ Prepares message to query task stats for a task group.
func prepareCmdMessage(id uint16, cfd uintptr) (msg netlinkMessage) {
        buf := bytes.NewBuffer([]byte{})
        addAttribute(buf, unix.CGROUPSTATS_CMD_ATTR_FD, uint32(cfd), 4)
        return prepareMessage(id, unix.CGROUPSTATS_CMD_GET, buf.Bytes())
}

// Prepares the message and generic headers and appends attributes as data.
func prepareMessage(headerType uint16, cmd uint8, attributes []byte) (msg netlinkMessage) {
        msg.Header.Type = headerType
        msg.Header.Flags = syscall.NLM_F_REQUEST
        msg.GenHeader.Command = cmd
        msg.GenHeader.Version = 0x1
        msg.Data = attributes
        return msg
}

内核收到cadvisor发送的request消息后，会根据cmd值CGROUPSTATS_CMD_ATTR_FD

统计容器对应cgroup下各状态进程数量填充到cgroupstats返回给cadvisor.

代码语言：javascript复制

static const struct genl_ops taskstats_ops[] = {
        {
                .cmd            = TASKSTATS_CMD_GET,
                .doit           = taskstats_user_cmd,
                .policy         = taskstats_cmd_get_policy,
                .flags          = GENL_ADMIN_PERM,
        },
        {
                .cmd            = CGROUPSTATS_CMD_GET,
                .doit           = cgroupstats_user_cmd,
                .policy         = cgroupstats_cmd_get_policy,
        },
};

struct cgroupstats {
        __u64   nr_sleeping;            /* Number of tasks sleeping */
        __u64   nr_running;             /* Number of tasks running */
        __u64   nr_stopped;             /* Number of tasks in stopped state */
        __u64   nr_uninterruptible;     /* Number of tasks in uninterruptible */
                                        /* state */
        __u64   nr_io_wait;             /* Number of tasks waiting on IO */
};

static int cgroupstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
{
        int rc = 0;
        struct sk_buff *rep_skb;
        struct cgroupstats *stats;
        struct nlattr *na;
        size_t size;
        u32 fd;
        struct fd f;

        na = info->attrs[CGROUPSTATS_CMD_ATTR_FD];
        if (!na)
                return -EINVAL;

        fd = nla_get_u32(info->attrs[CGROUPSTATS_CMD_ATTR_FD]);
        f = fdget(fd);
        if (!f.file)
                return 0;

        size = nla_total_size(sizeof(struct cgroupstats));

        rc = prepare_reply(info, CGROUPSTATS_CMD_NEW, &rep_skb,
                                size);
        if (rc < 0)
                goto err;

        na = nla_reserve(rep_skb, CGROUPSTATS_TYPE_CGROUP_STATS,
                                sizeof(struct cgroupstats));
        if (na == NULL) {
                nlmsg_free(rep_skb);
                rc = -EMSGSIZE;
                goto err;
        }

        stats = nla_data(na);
        memset(stats, 0, sizeof(*stats));

        rc = cgroupstats_build(stats, f.file->f_path.dentry);
        if (rc < 0) {
                nlmsg_free(rep_skb);
                goto err;
        }

        rc = send_reply(rep_skb, info);

err:
        fdput(f);
        return rc;                         
 }
 
 
 **
 * cgroupstats_build - build and fill cgroupstats
 * @stats: cgroupstats to fill information into
 * @dentry: A dentry entry belonging to the cgroup for which stats have
 * been requested.
 *
 * Build and fill cgroupstats so that taskstats can export it to user
 * space.
 */
int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
{
        struct kernfs_node *kn = kernfs_node_from_dentry(dentry);
        struct cgroup *cgrp;
        struct css_task_iter it;
        struct task_struct *tsk;

        /* it should be kernfs_node belonging to cgroupfs and is a directory */
        if (dentry->d_sb->s_type != &cgroup_fs_type || !kn ||
            kernfs_type(kn) != KERNFS_DIR)
                return -EINVAL;

        mutex_lock(&cgroup_mutex);

        /*
         * We aren't being called from kernfs and there's no guarantee on
         * @kn->priv's validity.  For this and css_tryget_online_from_dir(),
         * @kn->priv is RCU safe.  Let's do the RCU dancing.
         */
        rcu_read_lock();
        cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
        if (!cgrp || cgroup_is_dead(cgrp)) {
                rcu_read_unlock();
                mutex_unlock(&cgroup_mutex);
                return -ENOENT;
        }
        rcu_read_unlock();

        css_task_iter_start(&cgrp->self, 0, &it);
        while ((tsk = css_task_iter_next(&it))) {
                switch (tsk->state) {
                case TASK_RUNNING:
                        stats->nr_running  ;
                        break;
                case TASK_INTERRUPTIBLE:
                        stats->nr_sleeping  ;
                        break;
                case TASK_UNINTERRUPTIBLE:
                        stats->nr_uninterruptible  ;
                        break;
                case TASK_STOPPED:
                        stats->nr_stopped  ;
                        break;
                default:
                        if (delayacct_is_task_waiting_on_io(tsk))
                                stats->nr_io_wait  ;
                        break;
                }
        }
        css_task_iter_end(&it);

        mutex_unlock(&cgroup_mutex);
        return 0;
}

分析完container_cpu_load_average_10s是如何获取的，我们在实际场景来验证下结果：

部署脚本定期采集裸数据container_cpu_load_average_10s

代码语言：javascript复制

#cat get-container_cpu_load_average_10s.sh

#!/bin/bash
while true
do
  date
  kubectl get --raw /api/v1/nodes/eklet-subnet-g2wkclr1/proxy/metrics/cadvisor | grep load | grep fb88e098-31b2-4d5c-bbcf-5257361abc1f | grep -E 'web|shell'
  sleep 0.5
done

以图示为例，采集到的container_cpu_load_average_10数值为632

根据代码算法，当监控上一次采集container_cpu_load_average_10s时刻采集到的running线程数为0时，10秒后下一时刻采集到running线程数为1时，这里算出来container_cpu_load_average_10s的值为： cd.loadAvg = cd.loadAvg*cd.loadDecay float64(newLoad)*(1.0-cd.loadDecay)=0*0.36787944117144233 1*0.6321205588285577=0.632

container_cpu_load_average_10s=0.632*1000=632

当pod设置的cpu limit为2C，根据如下监控计算公式得出可以算出监控最终看到的load值为0.316:

max by (pod) (container_cpu_load_average_10s{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"}) / 1000 / max by (pod) (kube_pod_container_resource_limits_cpu_cores{container!="",container!~"sandbox|logrotate|sidecar",pod=~"$pod", container=~"$container"})

632/1000/kube_pod_container_resource_limits_cpu_cores=632/1000/2=0.316

对应监控采集到的值：

注：内核源码提供了工具获取cgroup running进程，可通过内核源码自带工具tools/accounting/getdelays.c 获取对应值(详细参考：https://utcc.utoronto.ca/~cks/space/blog/linux/LoadAverageWhereFrom)

参考：https://github.com/google/cadvisor/issues/2286

kernel metrics prometheus kubernetes

0 人点赞