一,问题描述:
Cgroup OOM导致node异常重启
二,问题分析
1,查看dmesg日志可以看到node在重启前确实出现频繁的cgroup oom:
2,查看vmcore log日志可以知道机器是由watchdog触发的自动重启:
3,Watchdog触发重启的原因是由于出现了cpu 21持有spin lock超过20s,导致watchdog/21线程无法被调度执行触发panic重启:
4,查看spinlock信息可以知道一个进程持有锁,另外有7个进程在等待spinlock:
5,找到占有spinlock的进程
搜索栈信息找到跟spinlock相关的进程:
查看上面8个进程的栈信息可以确认当前持有spinlock的进程是27033,其他进程在等待spinlock锁
从上面截图可知27033没有释放spinlock的原因是因为在等mutex信号量,获取mutex信号
量失败而进入UN状态:
6,找到占有mutex的进程
结合27033进程栈信息以及代码可以知道27033在等待cpu_hotplug锁:
正常情况下mutex 里的owner记录的是当前持有mutex信号量的进程,这个问题因为mutex的owner值为0,因此无法直接找到当前占有mutex的进程是哪个:
列出所有在等待该mutex的进程:
将所有正在等待mutex锁的进程列出保存到task_list.txt文件中:
list mutex_waiter.list -s mutex_waiter.task 0xffff88078f2cbb20 | grep task | grep -v 0x0 > task_list.txt
vi打开task_list.txt:
vi task_list.txt
在vi下输入如下命令将task = 替换为task_struct:
:%s/task =/struct task_struct/g
回到crash控制台,执行<./task_list.txt | grep -wE 'state|pid' | grep , 列出
所有正在等待mutex锁的进程state以及PID
crash> <./task_list.txt | grep -wE 'pid|state' | grep ,
state = 0,
pid = 13339,
state = 2,
pid = 26954,
state = 2,
pid = 26844,
state = 2,
pid = 27024,
state = 2,
pid = 26827,
state = 2,
pid = 27033,
state = 2,
pid = 27025,
state = 2,
pid = 27046,
state = 2,
pid = 26825,
state = 2,
pid = 26614,
state = 2,
pid = 26974,
state = 2,
pid = 26757,
state = 2,
pid = 27020,
state = 2,
pid = 26952,
state = 2,
pid = 21451,
state = 2,
pid = 21545,
state = 2,
pid = 27044,
state = 2,
pid = 21492,
state = 2,
pid = 27031,
state = 2,
pid = 26826,
state = 2,
pid = 27009,
state = 2,
pid = 21547,
state = 2,
pid = 27023,
state = 2,
pid = 26822,
state = 2,
pid = 27063,
state = 2,
pid = 26978,
state = 2,
pid = 27037,
state = 2,
pid = 21501,
state = 2,
pid = 27065,
state = 2,
pid = 26618,
state = 2,
pid = 27018,
state = 2,
pid = 27068,
state = 2,
pid = 27066,
state = 2,
pid = 27057,
state = 2,
pid = 27212,
state = 2,
pid = 27211,
state = 2,
pid = 27386,
state = 2,
pid = 27485,
state = 2,
pid = 27524,
state = 2,
pid = 27538,
state = 2,
pid = 27573,
crash>
从上面的输出信息可以看到13339进程的state值为0,而其他进程的值为2,查看内核代码可以知道state值为0代表TASK_RUNNING状态:
如果进程13339是在等待mutex锁,那么其状态应该是TASK_UNINTERRUPTIBLE,而不应该是当前的TASK_RUNNING状态,因此从其状态可以推断出进程13339持有mutex锁,从mutex_lock实现可以知道该函数是先拿锁,再将owner设置为当前进程task,因此13339进程在刚获取到mutex lock后设置owner前刚好被调度出去就会出现owner为0的情况:
7,为什么13339进程一直不释放mutex锁:
从vmcore的runq信息可以看到kdump时13339进程是等待在cpu 1 队列上,而由于13339进程被做了绑核,所以无法被调度到其他cpu上执行:
CPU 1 RUNQUEUE: ffff88081fc31dc0
CURRENT: PID: 27050 TASK: ffff8806b5f7cb30 COMMAND: "ps"
RT PRIO_ARRAY: ffff88081fc31f40
[ 0] PID: 12 TASK: ffff8808145a3220 COMMAND: "watchdog/1"
CFS RB_ROOT: ffff88081fc31e58
[120] PID: 20698 TASK: ffff8807aa36b220 COMMAND: "TSW/worker/1"
[120] PID: 27478 TASK: ffff8806d9189910 COMMAND: "node"
[120] PID: 24371 TASK: ffff880f22598000 COMMAND: "sleep"
[120] PID: 13339 TASK: ffff880714669910 COMMAND: "TSW/worker/1"
[120] PID: 10377 TASK: ffff8807a89a1910 COMMAND: "kworker/1:0"
[120] PID: 14 TASK: ffff8808145a6440 COMMAND: "ksoftirqd/1"
而当前cpu1上运行的是进程27050,进程27050因获取不到spin_lock锁而一直占有cpu 1,因此13339进程因一直无法得到cpu资源而不释放mutex 锁:
8,27050为什么无法获取spinlock锁?
在栈中搜索锁变量地址ffffea003fe216b0可知道进程27033,27041和27050是在等待同一把spin_lock锁,而当前spinlock锁被进程27033占有。
9,结论:
27033进程获取了spin_lock锁后尝试获取mutex失败进入UN状态,占有mutex锁的进程13339因cpu 1资源一直被27050进程占用而无法得到调度执行导致mutex锁无法正常释放,而27050进程之所以一直占有cpu 1资源是因为其在等待被27033进程占有的spinlock锁。
简单来讲就是:
27033等待13339的mutex锁,13339等待cpu 1资源,cpu 1资源被进程27050占住,27050进程等待27033进程占有的锁,这样就造成了内核死锁。
本文仅限于介绍死锁问题怎么分析,具体该问题的修复方法涉及到代码的实现就不在这里进一步阐述了。