早上7点起床,看到手机钉钉告警显示生产集群zookeeper异常,于是起床备份name node 、zookeeper等重要节点日志,当时很多角色已经挂掉,resourcemanager死掉,zkfc死掉,zookeeper死掉,两个程序都与zk有关所以直接重启zookeeper而后故障恢复。下面查找导致此次故障的原因:
一、故障现象时序分析情况:
【0:19】zookeeper触发告警异常:无法判断leader或follower
The health test result for ZOOKEEPER_SERVER_QUORUM_MEMBERSHIP has become concerning: Quorum membership status of this ZooKeeper server could not be determined. We last detected its status 1 minute(s), 4 second(s) ago. The result at that time was: This ZooKeeper server is a member of a quorum as a Follower.
【0:20:53】zk触发最大延迟阈值
The health test result for ZOOKEEPER_SERVER_MAX_LATENCY has become unknown: Not enough data to test: Test of whether the ZooKeeper server's maximum request latency is too high.
【0:22】zk集群开始报告GC持续时间过长
The health test result for ZOOKEEPER_SERVER_GC_DURATION has become concerning: Average time spent in garbage collection was 22.3 second(s) (37.10%) per minute over the previous 5 minute(s). Warning threshold: 30.00%.
经查zk日志,发现故障时间点出现磁盘写入延迟:
查看zookeeper数据存储目录,显示该该日志文件明显大于其他文件(正常50M左右,该文件大小为2.4G):
该日志为二进制文件,可以通过strings命令进行解析,通过解析该文件发现,有关于某张表的超长sql查询,单条sql内约含8W个11位数字,且由于多次查询,短时间共产生约1亿6000W个数据,约2.4G,这些数据导致zk内存快速打满,由于zk内存严重打满,导致full gc,最终拖垮zk集群。由于此时zk集群对外服务不可用,且resource manager依赖于此zookeeper,导致resource manager进入异常状态,无法对外提供正常服务
后续整改依然是规范生产加工的人
我们也调整了zk的gc