背景:CPU空闲时间<10%
排查过程
- 第一步:找出耗CPU的进程
使用top命令查看内存、cpu及各进程的信息。
代码语言:javascript复制[logdev@feed1 ~]$ top
top - 14:36:58 up 802 days, 11:55, 1 user, load average: 11.33, 11.28, 11.30
Tasks: 179 total, 1 running, 177 sleeping, 1 stopped, 0 zombie
Cpu(s): 96.9%us, 0.5%sy, 0.0%ni, 2.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8058904k total, 7524380k used, 534524k free, 105400k buffers
Swap: 2097148k total, 179096k used, 1918052k free, 1803916k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
26999 root 20 0 10.0g 4.4g 6648 S 774.3 57.9 159022:35 java
960 root 20 0 5769m 338m 6732 S 2.0 4.3 1473:09 java
8571 root 20 0 2305m 64m 2340 T 1.0 0.8 3514:09 titanagent
18467 root 20 0 24568 1456 1156 S 0.3 0.0 25:34.90 zabbix_agentd
18471 root 20 0 24568 1456 1156 S 0.3 0.0 25:38.52 zabbix_agentd
1 root 20 0 19232 964 784 S 0.0 0.0 1733:41 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 10:45.41 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 11:04.29 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0
6 root RT 0 0 0 0 S 0.0 0.0 1:36.32 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 11:36.72 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1
9 root 20 0 0 0 0 S 0.0 0.0 11:07.75 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 1:24.52 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 6:03.33 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/2
可以发现CPU使用率用户为96.9,PID=26999的Java进程CPU使用率高达774。
- 第二步:判断Java进程服务
使用ps命令查看对应Java进程服务。
代码语言:javascript复制[logdev@feed1 ~]$ ps aux | grep java
root 960 2.2 4.3 5907692 347072 ? Sl Mar25 1473:16 /Data/apps/jdk/bin/java -Xms128m -Xmx128m -Dcom.sun.management.jmxremote -Dflume.monitoring.type=http -Dflume.monitoring.port=34545 -cp /Data/apps/flume-ng/conf:/Data/apps/flume-ng/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name flumeLog --conf-file /Data/apps/flume-ng/conf/flume.conf
logdev 14288 0.0 0.0 103256 844 pts/0 S 14:40 0:00 grep java
root 26999 95.5 57.8 10458436 4662740 ? Sl Jan14 159052:18 /Data/apps/tomcat7/jdk1.8.0_51/jre/bin/java -Djava.util.logging.config.file=/Data/apps/tomcat7/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -server -Xms4096m -Xmx4096m -XX: UseParNewGC -XX: CMSParallelRemarkEnabled -XX: UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -Djdk.tls.ephemeralDHKeySize=2048 -Djava.endorsed.dirs=/Data/apps/tomcat7/endorsed -classpath /Data/apps/tomcat7/bin/bootstrap.jar:/Data/apps/tomcat7/bin/tomcat-juli.jar -Dcatalina.base=/Data/apps/tomcat7 -Dcatalina.home=/Data/apps/tomcat7 -Djava.io.tmpdir=/Data/apps/tomcat7/temp org.apache.catalina.startup.Bootstrap start
通过结果可以确定26999进程就是我们自己运行的Java服务。
- 第三步:找出耗CPU的线程
在上一步中找到了耗CPU的进程,下面要找到耗CPU的线程。我们知道一个进程中可以有多个线程,进程是线程的集合。使用下面的命令找到耗CPU的线程,意思就是找出该进程下的线程信息。
代码语言:javascript复制[logdev@feed1 ~]$ top -Hp 26999
top - 14:44:39 up 802 days, 12:03, 1 user, load average: 11.44, 11.44, 11.37
Tasks: 483 total, 11 running, 472 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.5%us, 0.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8058904k total, 7593116k used, 465788k free, 105964k buffers
Swap: 2097148k total, 179096k used, 1918052k free, 1872120k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
9137 root 20 0 10.0g 4.4g 6648 R 98.9 57.9 13526:45 java
9141 root 20 0 10.0g 4.4g 6648 R 92.3 57.9 12563:04 java
9132 root 20 0 10.0g 4.4g 6648 R 91.6 57.9 13260:51 java
9133 root 20 0 10.0g 4.4g 6648 R 88.0 57.9 13784:41 java
9138 root 20 0 10.0g 4.4g 6648 R 75.0 57.9 15449:16 java
9136 root 20 0 10.0g 4.4g 6648 R 67.1 57.9 15177:20 java
9142 root 20 0 10.0g 4.4g 6648 R 62.1 57.9 13996:03 java
9135 root 20 0 10.0g 4.4g 6648 R 61.8 57.9 14697:15 java
9134 root 20 0 10.0g 4.4g 6648 R 59.8 57.9 14952:27 java
9140 root 20 0 10.0g 4.4g 6648 R 51.1 57.9 12343:54 java
27011 root 20 0 10.0g 4.4g 6648 R 35.2 57.9 564:22.75 java
11027 root 20 0 10.0g 4.4g 6648 S 0.7 57.9 351:45.79 java
11028 root 20 0 10.0g 4.4g 6648 S 0.7 57.9 351:52.08 java
11059 root 20 0 10.0g 4.4g 6648 S 0.7 57.9 351:53.85 java
11024 root 20 0 10.0g 4.4g 6648 S 0.3 57.9 351:53.96 java
11025 root 20 0 10.0g 4.4g 6648 S 0.3 57.9 352:32.27 java
通过结果可以明显看出前几个线程CPU的使用率一直都很高。
- 第四步:从JVM堆栈中查找线程信息
我们获得了耗时较高的线程ID,下面通过JVM的堆栈信息找到线程信息,那么如何获得JVM的堆栈信息那,使用下面的命令下载该进程的运行中堆栈信息。
代码语言:javascript复制jstack -l 26999 >> 26999.txt
- 第五步:分析堆栈信息
从堆栈信息中发现大量的线程RUNABLE状态,可以初步确定问题方向是正确的。
代码语言:javascript复制"pool-2-thread-10" #419 prio=5 os_prio=0 tid=0x00007f39d80da000 nid=0x23b6 runnable [0x00007f39932d9000]
java.lang.Thread.State: RUNNABLE
at com.haodf.search.index.entity.Doctor.count(Doctor.java:177)
at com.haodf.search.index.entity.Doctor.toMap(Doctor.java:97)
at com.haodf.search.index.builder.DoctorIndexBuilder.bulidIndex(DoctorIndexBuilder.java:126)
at com.haodf.search.index.builder.DoctorIndexBuilder.lambda$dealMesageList$2(DoctorIndexBuilder.java:102)
at com.haodf.search.index.builder.DoctorIndexBuilder$$Lambda$34/431482904.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at com.haodf.search.index.builder.DoctorIndexBuilder.dealMesageList(DoctorIndexBuilder.java:101)
at com.haodf.search.index.builder.DoctorIndexBuilder.buildIndices(DoctorIndexBuilder.java:68)
at com.haodf.search.common.elastic.ElasticClient$1.run(ElasticClient.java:116)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00000006ed3a47b0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
通过https://fastthread.io分析日志数据。重点关注CPU consuming threads的日志情况:
- 第六步:定位代码问题
public static double count(String text, String sub) {
double count = 0;
int start = 0;
while ((start = text.indexOf(sub, start)) >= 0) {
start = sub.length();
count ; // 177 line
}
return count;
}
分析:问题出现在这个while中text.indexOf(sub, start),当sub为空字符时,会持续循环。所以就导致线程陷入死循环。
- 第七步:改进
在while循环前,校验sub字符是否为空。