起因
有同事联系我说公司内部云上一台虚拟机crash,有环境可以看vmcore。
代码语言:javascript复制[root@p32486v ~]# crash /usr/lib/debug/lib/modules/3.10.0-957.21.3.el7.x86_64/vmlinux vmcore
crash> bt
PID: 0 TASK: ffff8a4d3a338000 CPU: 4 COMMAND: "swapper/4"
#0 [ffff8a4fffd03670] machine_kexec at ffffffffa6263934
#1 [ffff8a4fffd036d0] __crash_kexec at ffffffffa631d162
#2 [ffff8a4fffd037a0] crash_kexec at ffffffffa631d250
#3 [ffff8a4fffd037b8] oops_end at ffffffffa696d778
#4 [ffff8a4fffd037e0] no_context at ffffffffa695bdbe
#5 [ffff8a4fffd03830] __bad_area_nosemaphore at ffffffffa695be55
#6 [ffff8a4fffd03880] bad_area_nosemaphore at ffffffffa695bfc6
#7 [ffff8a4fffd03890] __do_page_fault at ffffffffa69706d0
#8 [ffff8a4fffd03900] trace_do_page_fault at ffffffffa69709d6
#9 [ffff8a4fffd03940] do_async_page_fault at ffffffffa696ff62
#10 [ffff8a4fffd03960] async_page_fault at ffffffffa696c798
[exception RIP: __kmalloc 148]
RIP: ffffffffa641d8e4 RSP: ffff8a4fffd03a18 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8a4fcb2487c0 RCX: 000000000000a64f
RDX: 000000000000a64e RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff8a4fffd03a48 R8: 000000000001f040 R9: ffffffffc03e049b
R10: ffff8a4d3fc03e00 R11: 0000000000000000 R12: 0000000000000020
R13: ffff8a007365735f R14: 0000000000000008 R15: ffff8a4d3fc03e00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#11 [ffff8a4fffd03a50] tcp_v4_syn_recv_sock_toa at ffffffffc03e049b [toa]
#12 [ffff8a4fffd03ac0] tcp_check_req at ffffffffa68abfcf
#13 [ffff8a4fffd03b30] tcp_v4_do_rcv at ffffffffa68a999f
#14 [ffff8a4fffd03b78] tcp_v4_rcv at ffffffffa68aaf6c
#15 [ffff8a4fffd03c10] ip_local_deliver_finish at ffffffffa68838dd
#16 [ffff8a4fffd03c38] ip_local_deliver at ffffffffa6883bc9
#17 [ffff8a4fffd03c90] ip_rcv_finish at ffffffffa6883540
#18 [ffff8a4fffd03cb8] ip_rcv at ffffffffa6883ef9
#19 [ffff8a4fffd03d20] __netif_receive_skb_core at ffffffffa683a579
#20 [ffff8a4fffd03d90] __netif_receive_skb at ffffffffa683a878
#21 [ffff8a4fffd03db0] netif_receive_skb_internal at ffffffffa683a900
#22 [ffff8a4fffd03de0] napi_gro_receive at ffffffffa683b588
#23 [ffff8a4fffd03e08] virtnet_poll at ffffffffc03ed1f5 [virtio_net]
#24 [ffff8a4fffd03e78] net_rx_action at ffffffffa683af1f
#25 [ffff8a4fffd03ef8] __do_softirq at ffffffffa62a1075
#26 [ffff8a4fffd03f68] call_softirq at ffffffffa697932c
#27 [ffff8a4fffd03f80] do_softirq at ffffffffa622e675
#28 [ffff8a4fffd03fa0] irq_exit at ffffffffa62a13f5
#29 [ffff8a4fffd03fb8] do_IRQ at ffffffffa697a606
--- <IRQ stack> ---
#30 [ffff8a4d3a383df8] ret_from_intr at ffffffffa696c362
[exception RIP: native_safe_halt 11]
RIP: ffffffffa696af5b RSP: ffff8a4d3a383ea8 RFLAGS: 00000246
RAX: ffffffffa696ad30 RBX: ffff8a4fffd15ec0 RCX: 0100000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
RBP: ffff8a4d3a383ea8 R8: 0000000000000000 R9: ffff8a4fe728bda8
R10: 0000000000000000 R11: 7fffffffffffffff R12: 0000000000000004
R13: 000005fb53bfaa00 R14: cc9a9800d079cbbe R15: 0000000000000086
ORIG_RAX: ffffffffffffff3c CS: 0010 SS: 0018
#31 [ffff8a4d3a383eb0] default_idle at ffffffffa696ad4e
#32 [ffff8a4d3a383ed0] arch_cpu_idle at ffffffffa62366f0
#33 [ffff8a4d3a383ee0] cpu_startup_entry at ffffffffa62fc6da
#34 [ffff8a4d3a383f28] start_secondary at ffffffffa6258047
#35 [ffff8a4d3a383f50] start_cpu at ffffffffa62000d5
crash>
草率 迷糊
为什么cpu在内核收包时突然发生了异常,异常号是14,执行async_page_fault处理异常,处理异常时发现发现cr2中地址处于内核空间中,内核是不会发生page fault的,async_page_fault MMU会触发,kvm环境下kvm也会触发,那一定是kvm的问题了。
guest初始化时设置异常处理函数,并且write msr把地址的一块物理内相给了kvm,说你要给我触发异常就写这个物理地址。
代码语言:javascript复制kvm_guest_init
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
static void __init kvm_apf_trap_init(void)
{
set_intr_gate(14, async_page_fault);
}
static void kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
return;
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
__this_cpu_write(apf_reason.enabled, 1);
printk(KERN_INFO"KVM setup async PF for cpu %dn",
smp_processor_id());
}
}
异常来了,kvm开始处理,发现reason不是kvm,然后进入普通处理流程,发现pagefault地址在内核空间,内核不能发生pagefault,crash了。
代码语言:javascript复制async_page_fault->do_async_page_fault
u32 kvm_read_and_reset_pf_reason(void)
{
u32 reason = 0;
if (__this_cpu_read(apf_reason.enabled)) {
reason = __this_cpu_read(apf_reason.reason);
__this_cpu_write(apf_reason.reason, 0);
}
return reason;
}
do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
{
enum ctx_state prev_state;
switch (kvm_read_and_reset_pf_reason()) {
default:
trace_do_page_fault(regs, error_code);
break;
case KVM_PV_REASON_PAGE_NOT_PRESENT:
/* page is swapped out by the host. */
prev_state = exception_enter();
exit_idle();
kvm_async_pf_task_wait((u32)read_cr2());
exception_exit(prev_state);
break;
case KVM_PV_REASON_PAGE_READY:
rcu_irq_enter();
exit_idle();
kvm_async_pf_task_wake((u32)read_cr2());
rcu_irq_exit();
break;
}
}
dotraplinkage void __kprobes
trace_do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
enum ctx_state prev_state;
/*
* The exception_enter and tracepoint processing could
* trigger another page faults (user space callchain
* reading) and destroy the original cr2 value, so read
* the faulting address now.
*/
unsigned long address = read_cr2();
prev_state = exception_enter();
trace_page_fault_entries(regs, error_code);
__do_page_fault(regs, error_code, address);
exception_exit(prev_state);
}
最终决定把kvm apf关了尝试一下,guest kernel加参数no-kvmapf,guest不再写MSR_KVM_ASYNC_PF_EN,那么kvm也就不再给guest inject这样的irq,guest就不会执行到async_page_fault。
第二天早上一看,又crash重启了,决定回头再分析一下。
如果host把guest的page swap out了, guest vcpu上运行的进程触发了EPT violation要等kvm把page swap in才能恢复运行,要等很长时间,所以asyn page fault就想着把发生EPT violation的vcpu换个进程运行,就注入一个异常通知vcpu切换进程,等page swap in了再注入异常通知vcpu可以运行刚才被切换走的进程了,host上内存足够用,没有把任何内存swap out出去。
想不明白为什么要把这种机制和内核pagefault handler混在一起,看了代码感觉cpu 发生pagefault和kvm有关系,其实cpu发生pagefault和kvm EPT violation没什么关系,各干各的活,不细想就迷糊了,第一天就是没想明白,有点迷糊。
再次尝试
toa是什么东西,网上搜索发现是配合lvs/dpvs用的,正好组里有做lvs和dpvs的同事,要了代码看了看,其实代码是开源的,最早开源的toa只有ipv4的,ipv4地址只有32bit,系统调用一个寄存器就传到用户态了,可ipv6地址有128 bit,一个寄存器放不下,只能内核kmalloc暂存,copy到用户态,然后内核再找个时机kfree,公司同事就是这样修改了。
怀疑toa模块中kmalloc后memcpy时长度不对,把重要数据覆盖了。地址不对,长度不对,非法报文都排查了一遍,看代码没问题。
代码语言:javascript复制while (length > 0) {
int opcode = *ptr ;
int opsize;
switch (opcode) {
case TCPOPT_EOL:
return NULL;
case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
length--;
continue;
default:
opsize = *ptr ;
if (opsize < 2) /* "silly options" */
return NULL;
if (opsize > length)
return NULL; /* don't parse partial options */
if ((TCPOPT_TOA == opcode && TCPOLEN_TOA == opsize) ||
(TCPOPT_TOA_V6 == opcode && TCPOLEN_TOA_V6 == opsize) ) {
memcpy(&tdata, ptr - 2, opsize);
//TOA_DBG("find toa data: ip = %u.%u.%u.%u, port = %un", NIPQUAD(tdata.ip),
//ntohs(tdata.port));
ret_ptr = kmalloc(opsize, GFP_ATOMIC);
if (! ret_ptr)
return NULL;
memcpy(ret_ptr, &tdata, opsize);
return ret_ptr;
}
ptr = opsize - 2;
length -= opsize;
}
}
怀疑cpu barrie或者aligned不对,然后看汇编似乎也没问题。
代码语言:javascript复制crash> dis tcp_v4_syn_recv_sock_toa
0xffffffffc03e043b <tcp_v4_syn_recv_sock_toa 299>: cmp $0x1,%esi
0xffffffffc03e043e <tcp_v4_syn_recv_sock_toa 302>: jle 0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa 232>
0xffffffffc03e0440 <tcp_v4_syn_recv_sock_toa 304>: cmp %r13d,%esi
0xffffffffc03e0443 <tcp_v4_syn_recv_sock_toa 307>: jg 0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa 232>
0xffffffffc03e0445 <tcp_v4_syn_recv_sock_toa 309>: cmp $0x8,%esi
0xffffffffc03e0448 <tcp_v4_syn_recv_sock_toa 312>: je 0xffffffffc03e04dc <tcp_v4_syn_recv_sock_toa 460>
0xffffffffc03e044e <tcp_v4_syn_recv_sock_toa 318>: cmp $0x14,%esi
0xffffffffc03e0451 <tcp_v4_syn_recv_sock_toa 321>: je 0xffffffffc03e0470 <tcp_v4_syn_recv_sock_toa 352>
0xffffffffc03e0453 <tcp_v4_syn_recv_sock_toa 323>: lea -0x2(%rdi,%rdx,1),%rax
0xffffffffc03e0458 <tcp_v4_syn_recv_sock_toa 328>: sub %esi,%r13d
0xffffffffc03e045b <tcp_v4_syn_recv_sock_toa 331>: jmpq 0xffffffffc03e03d7 <tcp_v4_syn_recv_sock_toa 199>
0xffffffffc03e0460 <tcp_v4_syn_recv_sock_toa 336>: add $0x14,%rax
0xffffffffc03e0464 <tcp_v4_syn_recv_sock_toa 340>: jne 0xffffffffc03e03d7 <tcp_v4_syn_recv_sock_toa 199>
0xffffffffc03e046a <tcp_v4_syn_recv_sock_toa 346>: jmp 0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa 232>
0xffffffffc03e046c <tcp_v4_syn_recv_sock_toa 348>: nopl 0x0(%rax)
0xffffffffc03e0470 <tcp_v4_syn_recv_sock_toa 352>: cmp $0xfd,�x
0xffffffffc03e0476 <tcp_v4_syn_recv_sock_toa 358>: jne 0xffffffffc03e0453 <tcp_v4_syn_recv_sock_toa 323>
0xffffffffc03e0478 <tcp_v4_syn_recv_sock_toa 360>: lea -0x5c(%rbp),%r13
0xffffffffc03e047c <tcp_v4_syn_recv_sock_toa 364>: movzbl %dl,%r12d
0xffffffffc03e0480 <tcp_v4_syn_recv_sock_toa 368>: mov %rax,%rsi
0xffffffffc03e0483 <tcp_v4_syn_recv_sock_toa 371>: mov %r12,%rdx
0xffffffffc03e0486 <tcp_v4_syn_recv_sock_toa 374>: mov %r13,%rdi
0xffffffffc03e0489 <tcp_v4_syn_recv_sock_toa 377>: callq 0xffffffffa6586b60 <__memcpy>
0xffffffffc03e048e <tcp_v4_syn_recv_sock_toa 382>: mov $0x20,%esi
0xffffffffc03e0493 <tcp_v4_syn_recv_sock_toa 387>: mov %r12,%rdi
0xffffffffc03e0496 <tcp_v4_syn_recv_sock_toa 390>: callq 0xffffffffa641d850 <__kmalloc>
0xffffffffc03e049b <tcp_v4_syn_recv_sock_toa 395>: test %rax,%rax
0xffffffffc03e049e <tcp_v4_syn_recv_sock_toa 398>: je 0xffffffffc03e03f8 <tcp_v4_syn_recv_sock_toa 232>
0xffffffffc03e04a4 <tcp_v4_syn_recv_sock_toa 404>: mov %r12,%rdx
0xffffffffc03e04a7 <tcp_v4_syn_recv_sock_toa 407>: mov %r13,%rsi
0xffffffffc03e04aa <tcp_v4_syn_recv_sock_toa 410>: mov %rax,%rdi
0xffffffffc03e04ad <tcp_v4_syn_recv_sock_toa 413>: callq 0xffffffffa6586b60 <__memcpy>
看汇编没看出问题,突然想到既然kmalloc用的地方没问题,哪什么地方kfree的,会不会kfree有问题。
代码语言:javascript复制crash> dis __kmalloc
0xffffffffa641d8b1 <__kmalloc 97>: mov (%r10),%r8 //r10是s,取s->cpu_slab到r8
0xffffffffa641d8b4 <__kmalloc 100>: add %gs:0x59bf38e4(%rip),%r8 # 0x111a0
0xffffffffa641d8bc <__kmalloc 108>: mov 0x8(%r8),%rdx //r8是c, rdx是tid
0xffffffffa641d8c0 <__kmalloc 112>: mov (%r8),%r13 //r13是object
0xffffffffa641d8c3 <__kmalloc 115>: mov 0x10(%r8),%rax //取page
0xffffffffa641d8c7 <__kmalloc 119>: test %r13,%r13 //if !object
0xffffffffa641d8ca <__kmalloc 122>: je 0xffffffffa641d9f9 <__kmalloc 425>
0xffffffffa641d8d0 <__kmalloc 128>: test %rax,%rax //if !page
0xffffffffa641d8d3 <__kmalloc 131>: je 0xffffffffa641d9f9 <__kmalloc 425>
0xffffffffa641d8d9 <__kmalloc 137>: movslq 0x20(%r10),%rax //r10是s,取offset
0xffffffffa641d8dd <__kmalloc 141>: lea 0x1(%rdx),%rcx //next_tid
0xffffffffa641d8e1 <__kmalloc 145>: mov (%r10),%r8 //取s->cpu_slab到r8
0xffffffffa641d8e4 <__kmalloc 148>: mov 0x0(%r13,%rax,1),%rbx
0xffffffffa641d8e9 <__kmalloc 153>: mov %r13,%rax
0xffffffffa641d8ec <__kmalloc 156>: cmpxchg16b %gs:(%r8)
0xffffffffa641d8f1 <__kmalloc 161>: sete %al
0xffffffffa641d8f4 <__kmalloc 164>: test %al,%al
0xffffffffa641d8f6 <__kmalloc 166>: je 0xffffffffa641d8b1 <__kmalloc 97> //goto redo;
0xffffffffa641d8f8 <__kmalloc 168>: movslq 0x20(%r10),%rax //s->offset
0xffffffffa641d8fc <__kmalloc 172>: test $0x8000,%r12d //gfpflags & __GFP_ZERO
0xffffffffa641d903 <__kmalloc 179>: prefetcht0 (%rbx,%rax,1)
出问题时RIP在这一行
代码语言:javascript复制mov 0x0(%r13,%rax,1),%rbx
对应着如下代码,难就难在全都是inline函数,汇编和代码不好对应。
代码语言:javascript复制__kmalloc->slab_alloc->slab_alloc_node->get_freepointer_safe->get_freepointer
static inline void *get_freepointer(struct kmem_cache *s, void *object)
{
return *(void **)(object s->offset);
}
简化一下就是*(object s->offset),把object指针偏移一下,然后取指针指向的内容,指针有问题,指向的地址没有物理page对应,发生了pagefault。object来自于freelist,大概率就是freelist有问题,为什么freelist有问题,基本可以确定是重复kfree了。
代码语言:javascript复制crash> kmem -s ffff8a4d3fc03e00
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
kmem: kmalloc-8: slab: ffffd3ca90ffaf00 duplicate freelist object: ffff8a4fffebca10
kmem: kmalloc-8: slab: ffffd3ca90b335c0 duplicate freelist object: ffff8a4feccd79a8
kmem: kmalloc-8: slab: ffffd3ca90b0c600 duplicate freelist object: ffff8a4fec318fe8
kmem: kmalloc-8: slab: ffffd3ca8cf65e00 duplicate freelist object: ffff8a4efd978e38
kmem: kmalloc-8: slab: ffffd3ca8cf4e100 duplicate freelist object: ffff8a4efd384d80
kmem: kmalloc-8: slab: ffffd3ca8cf679c0 duplicate freelist object: ffff8a4efd9e7b70
ffff8a4d3fc03e00 kmalloc-8
看代码kfree(sk->sk_user_data)后没有置空sk->sk_user_data = NULL。
把sk->sk_user_data = NULL加到代码,全新编译内核模块,结果后来又crash了,看了toa的统计我发现free次数还是多于malloc次数。
代码语言:javascript复制cat /proc/1/task/1/net/toa_stats > toa_stats.txt
j="0"; for i in $(awk 'NR==2{print $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9}' toa_stats.txt); do j=$[$i $j]; done;
k="0"; for i in $(awk 'NR==4{print $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9}' toa_stats.txt); do k=$[$i $k]; done;
echo "kmalloc $j kfree $k"
进一步怀疑是没有加锁导致的,一个线程accept后,另两个线程同时调用getpeername就会出现重复kfree,得加锁保护。
还有一个问题就是没有调用getpeername就会存在内存泄露。
看dpvs中toa已经修改sk_free时kfree(sk->sk_user_data),并且做了保护,kfree和getpeername互斥,但加的锁是per cpu spin lock,用的是spin_lock_bh和spin_unlock_bh,访问是在process context中,kmalloc是在soft irq context中,kfree有可能在soft irq也有可能在process context中,但不能保证一定是同一个cpu,比如一个cpu线程getpeername,另一个cpu线程close那么也存在问题,假如不允许这样,会不会一个cpu线程getpeername,此时收到了reset包,另一个cpu在soft irq中kfree,感觉这样的情况是会出现的。
看汇编是很有意思的一件事,三分基础七分观察。
遗留问题
alloc_percpu的数据在crash中怎么查看?想看toa的kmalloc和kfree的统计数据,统计数据是动态alloc的percpu数据,怎么找到它的内存,看汇编和gs寄存器有关系,推理了好久没搞明白,有懂的请留言指教。
参考文献
iqiyi/dpvs
汇编语言基础 - 文集 - 简书
铁甲万能狗:第9篇-C/C x86_64的函数栈