C/C 编写的程序,崩溃后有时不能生成core文件(即使设置了ulimited),所以往往不知道发生了什么事情,生产环境根本不允许研发小朋友去调试,日志有时候看不出问题了。(如果生成了core文件,或通过日志能定位到问题所以,则可略过此文章)。
本文章专门针对于没有生成core文件、不能通过日志分析问题的情况
第一步:写一段测试代码吧,main.cpp:
代码语言:javascript复制#include <iostream>
#include <cstdio>
#include <memory.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <ucontext.h>
#include <dlfcn.h>
#include <execinfo.h>
#include <thread>
#include <chrono>
#include <vector>
#include <functional>
#include <iomanip>
#include <mutex>
#include <random>
using namespace std;
void sigsegv_handler(int signum)
{
std::cout<<"catch signal:"<<signum<<endl;
void *buffer[1024*1024*10];
char **strings;
int j,nptrs;
nptrs=backtrace(buffer,1024*1024*10);
cout<<"backtrace returned address:"<<nptrs<<endl;
strings=backtrace_symbols(buffer,nptrs);
if (strings!=NULL)
{
for(j=0;j<nptrs; j)
{
cout<<strings[j]<<endl;
}
}
free(strings);
}
static void catch_sigsegv()
{
struct sigaction action;
memset(&action, 0, sizeof(action));
action.sa_handler=sigsegv_handler;
action.sa_flags=SA_NODEFER|SA_RESETHAND;
if (sigaction(SIGSEGV, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGFPE, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGINT, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGILL, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGTERM, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGABRT, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
if (sigaction(SIGSEGV, &action, NULL) != 0) { cout<<"sig_action error"<<endl; }
}
std::mutex m_mutex;
void *thread_entry(int thread_index)
{
unsigned long index=0;
std::default_random_engine e;
e.seed(thread_index);
while(true){
auto random_value=e();
if(random_value3==0){
int *p=nullptr;
*p=10;
}
{
auto t=std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
std::lock_guard<std::mutex> lock(m_mutex);
std::cout<<"thread_index["<<thread_index<<"]:id["<<std::this_thread::get_id()<<"]:["<<std::put_time(std::localtime(&t), "%Y-%m-%d %X")<<"]:"<< index<<":random_value="<<random_value<<std::endl;
}
std::this_thread::sleep_for(std::chrono::seconds(1));
}
return nullptr;
}
int main(int argc,char *argv[])
{
//catch_sigsegv();
std::vector<std::thread> thread_vector;
for(int i=0;i<10; i){
thread_vector.push_back(std::move(std::thread(std::bind(thread_entry,i))));
}
for(int i=0;i<10; i){
thread_vector[i].join();
}
return 0;
}
代码写得比较乱,还有错误,虽然错误不是很明显,运行一段时间错误就出来了。
写完之后,编译(前提你的编译器最低支持C 11语法):
代码语言:javascript复制g -Wall -g -rdynamic main.cpp -o main -lpthread -std=c 11
编译完成后,运行:./main
运行输出如下:
代码语言:javascript复制thread_index[9]:id[139681072420608]:[2020-07-28 18:24:26]:4:random_value=274558334
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:26]:4:random_value=1614694654
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:26]:4:random_value=629750996
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:26]:4:random_value=1437098323
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:26]:4:random_value=452154665
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:27]:5:random_value=1144108930
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:27]:5:random_value=1144108930
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:27]:5:random_value=281468426
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:27]:5:random_value=140734213
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:27]:5:random_value=1284843143
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:27]:5:random_value=1707045782
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:27]:5:random_value=422202639
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:27]:5:random_value=1425577356
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:27]:5:random_value=562936852
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:27]:5:random_value=1566311569
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:28]:6:random_value=470211272
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:28]:6:random_value=470211272
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:28]:6:random_value=1880845088
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:28]:6:random_value=940422544
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:28]:6:random_value=1410633816
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:28]:6:random_value=203572713
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:28]:6:random_value=2084417801
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:28]:6:random_value=1614206529
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:28]:6:random_value=673783985
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:28]:6:random_value=1143995257
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:29]:7:random_value=101027544
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:29]:7:random_value=202055088
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:29]:7:random_value=303082632
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:29]:7:random_value=505137720
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:29]:7:random_value=909247896
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:29]:7:random_value=101027544
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:29]:7:random_value=404110176
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:29]:7:random_value=808220352
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:29]:7:random_value=606165264
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:29]:7:random_value=707192808
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:30]:8:random_value=1457850878
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:30]:8:random_value=768218109
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:30]:8:random_value=78585340
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:30]:8:random_value=157170680
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:30]:8:random_value=235756020
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:30]:8:random_value=1615021558
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:30]:8:random_value=1536436218
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:30]:8:random_value=846803449
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:30]:8:random_value=925388789
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:30]:8:random_value=1457850878
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:31]:9:random_value=1458777923
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:31]:9:random_value=770072199
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:31]:9:random_value=81366475
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:31]:9:random_value=162732950
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:31]:9:random_value=851438674
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:31]:9:random_value=1621510873
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:31]:9:random_value=1540144398
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:31]:9:random_value=244099425
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:31]:9:random_value=932805149
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:31]:9:random_value=1458777923
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:32]:10:random_value=2007237709
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:32]:10:random_value=1866991771
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:32]:10:random_value=1726745833
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:32]:10:random_value=1306008019
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:32]:10:random_value=1446253957
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:32]:10:random_value=885270205
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:32]:10:random_value=1586499895
thread_index[7]:id[139681089206016]:[2020-07-28 18:24:32]:10:random_value=1165762081
thread_index[8]:id[139681080813312]:[2020-07-28 18:24:32]:10:random_value=1025516143
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:32]:10:random_value=2007237709
thread_index[0]:id[139681147954944]:[2020-07-28 18:24:33]:11:random_value=823564440
thread_index[2]:id[139681131169536]:[2020-07-28 18:24:33]:11:random_value=1647128880
thread_index[3]:id[139681122776832]:[2020-07-28 18:24:33]:11:random_value=323209673
thread_index[6]:id[139681097598720]:[2020-07-28 18:24:33]:11:random_value=646419346
thread_index[5]:id[139681105991424]:[2020-07-28 18:24:33]:11:random_value=1970338553
thread_index[9]:id[139681072420608]:[2020-07-28 18:24:33]:11:random_value=969629019
thread_index[4]:id[139681114384128]:[2020-07-28 18:24:33]:11:random_value=1146774113
thread_index[1]:id[139681139562240]:[2020-07-28 18:24:33]:11:random_value=823564440
根据运行情况而定,过不了多久就会崩溃,如果长时间还没有崩溃,请调整一下第65行的123参数,调得越小越好,调到个位数很快就会崩溃。
第二步:假设没有生成core文件(如果生成了可以删除)
使用命令查看core的信息:./dmesg 如下所示(不同的机器略有不同):
代码语言:javascript复制[85277.854606] CPU7: Core temperature/speed normal
[90749.125541] CPU7: Core temperature above threshold, cpu clock throttled (total events = 2266885)
[90749.125542] CPU15: Core temperature above threshold, cpu clock throttled (total events = 2266885)
[90749.125544] CPU0: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125546] CPU12: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125547] CPU8: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125549] CPU6: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125551] CPU1: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125553] CPU5: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125554] CPU13: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125556] CPU2: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125557] CPU4: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125559] CPU3: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125560] CPU11: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125561] CPU10: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125562] CPU9: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125563] CPU14: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125564] CPU15: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.125580] CPU7: Package temperature above threshold, cpu clock throttled (total events = 4718542)
[90749.126527] CPU7: Core temperature/speed normal
[90749.126528] CPU5: Package temperature/speed normal
[90749.126529] CPU11: Package temperature/speed normal
[90749.126530] CPU3: Package temperature/speed normal
[90749.126531] CPU10: Package temperature/speed normal
[90749.126531] CPU2: Package temperature/speed normal
[90749.126533] CPU6: Package temperature/speed normal
[90749.126534] CPU1: Package temperature/speed normal
[90749.126535] CPU15: Core temperature/speed normal
[90749.126535] CPU9: Package temperature/speed normal
[90749.126537] CPU12: Package temperature/speed normal
[90749.126538] CPU14: Package temperature/speed normal
[90749.126539] CPU8: Package temperature/speed normal
[90749.126540] CPU4: Package temperature/speed normal
[90749.126541] CPU13: Package temperature/speed normal
[90749.126542] CPU0: Package temperature/speed normal
[90749.126542] CPU15: Package temperature/speed normal
[90749.126555] CPU7: Package temperature/speed normal
[93203.608134] main[32241]: segfault at 0 ip 000000000040749a sp 00007fc3c8f13c90 error 6 in main[400000 c000]
[95130.640597] main[9295]: segfault at 0 ip 000000000040742a sp 00007ff8bff35c90 error 6 in main[400000 c000]
[95130.640616] main[9296]: segfault at 0 ip 000000000040742a sp 00007ff8bf734c90 error 6 in main[400000 c000]
上面的信息大部分是没有用的,主要看segfault相关的信息,上面是最后三行,下面对segfault进行解析:
1、从上面可以看出,有三行和main程序有关的segfault信息
2、segfault at 0:0是内存地址,此处可能是访问了非法的内存地址,如:nullptr
3、ip 000000000040749a/ip 000000000040742a:ip,不是网络中的ip,而是指令指针(Instruction Pointer)的缩写,ip相关知识请看汇编或百度,这里不做解释。ip后面的是址是非常重要的——cpu执行代码时,崩溃的地方(有时候ip后面的地址是null,这种情况下节再分析,也是有办法的)
4、sp 00007fc3c8f13c90:sp和bp对应,bp是基址寄存器,sp则指向的是栈顶。不了解的继续补汇编吧,偶也帮不了你
5、error 6:猜也猜得出来,是错误码,这里的错误码有规则的,在linux内核的fault.c文件中有说明:
错误码/和操作系统有关,所以一定要结合你的操作系统来解读error后的错误码
代码语言:javascript复制 * Page fault error code bits:
*
* bit 0 == 0: no page found1: protection fault
* bit 1 == 0: read access1: write access
* bit 2 == 0: kernel-mode access1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
错误码6等于二进制中的(110),结合上面的意思,是说:在用户态进行了写操作。 到这一步 ,初步分析为是赋值导致的
6、in main[400000 c000]:400000,指的是映射的地址,后面的c000指的是程序的大小
至此,把需要的信息收集完了。
第三步:见证奇迹的时刻
1、把编译出来的main进行反编译:objdump -d main >main.od ,顺便也读取一下符号吧:nm main >main.nm
2、用vim打开main.od,查找segfault 行中ip后面的地址,这里分别是:ip 000000000040749a/ip 000000000040742a,没有找到40749a这个地址,但这个地址也在程序中;找到了40742a,如下图所示:
从627行的代码可以看出:mov -0x28(%rbp),%rax:把%rbp指向的地址值(可理解为函数的局部变量的地址)赋值给%rax
从628行的代码可以看出:movl 0xa,(%rax):0xa是一个立即数(10),(%rax)是寄存器寻址,(不清楚的看一下汇编中的几种寻址方式:直接寻址、间接寻址,好像一共有七八种寻址方式),这行的意思是把10赋值为当前函数的一个局部变量。
到这一步,其实问题基本已经定位了。。。。。。。。。结合代码看看就清楚了
总之,好晕的呀,尤其是对于没有学过汇编的或汇编基础不好的同学来讲,那么,有没有简单的方法呢?答案是:有
3、使用addr2line工具
执行命令:addr2line -e main 40749a ,显示结果如下:
代码语言:javascript复制_Z12thread_entryi
/home/lian.shao.hua/work/demo/catch_segv/main.cpp:73 (discriminator 3)
执行命令:addr2line -e main 40742a ,显示结果如下:
代码语言:javascript复制_Z12thread_entryi
/home/lian.shao.hua/work/demo/catch_segv/main.cpp:68
如此,错误的代码行就非常明显了:main.cpp的73行和68行
当然,如果编译的时候开启了-O1、-O2、-O3,会影响问题定位的
本文由来源 ztenv,由 javajgs_com 整理编辑,其版权均为 ztenv 所有,文章内容系作者个人观点,不代表 Java架构师必看 对观点赞同或支持。如需转载,请注明文章来源。