时隔五个月(点击阅读前文),如标题所示的问题再次发生,本次由于我们大数据监控系统的完善,让我对该问题进行了更深一步的研究。以下是整个排查过程和解决方案:
一、问题说明
从8月8日早上8点12
收到第一条ResourceManager
服务异常报警,截止到8月11日早上8点
,每天早上8点到8点12
之间频繁出现ResourceManager
服务异常问题,晚上8点和下午1-3点
偶尔出现该问题。以下是SpaceX
统计出的ResourceManager
状态异常次数数据:
二、异常原因
1、异常信息
以下截取的是8月8日20点至20点12
之间的日志,其他时间段出现问题时的异常信息与此信息一样:
2019-08-08 20:12:18,681 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 544
2019-08-08 20:12:18,886 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.204.245.44/10.204.245.44:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-08 20:12:18,887 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.204.245.44/10.204.245.44:5181, initiating session
2019-08-08 20:12:18,887 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.204.245.44/10.204.245.44:5181, sessionid = 0x26c00dfd48e9068, negotiated timeout = 60000
2019-08-08 20:12:20,850 WARN org.apache.zookeeper.ClientCnxn: Session 0x26c00dfd48e9068 for server 10.204.245.44/10.204.245.44:5181, unexpected error, closing socket connection and attempting reconnect
java.lang.OutOfMemoryError: Java heap space
2019-08-08 20:12:20,951 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:989)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:986)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1128)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1161)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:986)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1000)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1017)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:713)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:243)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:226)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:872)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:867)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2、异常原因
主要是由于ZK
服务端限制单个节点数据量大小不能超过1M
导致,客户端提交的数据超过1M
后ZK
服务端会抛出如下异常:
Exception causing close of session 0x2690d678e98ae8b due to java.io.IOException: Len error 1788046
抛出异常后,YARN
会不断地对ZK
进行重试操作,重试间隔短,重试次数多,使YARN
内存溢出,不能正常提供服务。
3、YARN异常代码
以下是org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
中发生异常的代码的方法:
/**
* 更新任务重试信息
*
* @param appAttemptId
* @param attemptStateDataPB
* @throws Exception
*/
@Override
public synchronized void updateApplicationAttemptStateInternal(
ApplicationAttemptId appAttemptId,
ApplicationAttemptStateData attemptStateDataPB)
throws Exception {
String appIdStr = appAttemptId.getApplicationId().toString();
String appAttemptIdStr = appAttemptId.toString();
String appDirPath = getNodePath(rmAppRoot, appIdStr);
String nodeUpdatePath = getNodePath(appDirPath, appAttemptIdStr);
if (LOG.isDebugEnabled()) {
LOG.debug("Storing final state info for attempt: " appAttemptIdStr
" at: " nodeUpdatePath);
}
byte[] attemptStateData = attemptStateDataPB.getProto().toByteArray();
if (existsWithRetries(nodeUpdatePath, true) != null) {
setDataWithRetries(nodeUpdatePath, attemptStateData, -1);
} else {
createWithRetries(nodeUpdatePath, attemptStateData, zkAcl,
CreateMode.PERSISTENT);
LOG.debug(appAttemptId " znode didn't exist. Created a new znode to"
" update the application attempt state.");
}
}
这段代码主要是执行更新或添加任务重试状态信息到ZK
中的操作,YARN
在调度任务过程中,可能会对任务进行多次重试,主要受网络、硬件、资源等因素影响。如果任务重试信息保存ZK
失败,会调用org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.ZKAction.runWithRetries
方法重试。默认重试1000
次,每次重试间隔受是否启用YARN
高可用影响,也就是yarn-site.xml
中的yarn.resourcemanager.ha.enabled
参数是否为true
。该重试间隔官方解释如下:
Retry interval in milliseconds when connecting to ZooKeeper. When HA is enabled, the value here is NOT used. It is generated automatically from yarn.resourcemanager.zk-timeout-ms and yarn.resourcemanager.zk-num-retries.
在是否启用YARN
高可用条件下,重试间隔机制如下:
(1)未启用YARN
高可用:
受yarn.resourcemanager.zk-retry-interval-ms
控制,该参数在BI
生产环境使用默认值1000
,单位为毫秒。
(2)启用YARN
高可用:
受yarn.resourcemanager.zk-timeout-ms
(ZK
会话超时时间)和yarn.resourcemanager.zk-num-retries
(操作失败后重试次数)参数控制,计算公式为:
重试时间间隔(yarn.resourcemanager.zk-retry-interval-ms )=yarn.resourcemanager.zk-timeout-ms(ZK session超时时间)/yarn.resourcemanager.zk-num-retries(重试次数)
重试间隔确定过程在org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.initInternal
方法源码为:
// 计算重试连接ZK的时间间隔,以毫秒表示
if (HAUtil.isHAEnabled(conf)) { // 高可用情况下是:重试时间间隔=session超时时间/重试ZK的次数
zkRetryInterval = zkSessionTimeout / numRetries;
} else {
zkRetryInterval =
conf.getLong(YarnConfiguration.RM_ZK_RETRY_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_ZK_RETRY_INTERVAL_MS);
}
BI
生产环境的配置:
yarn.resourcemanager.zk-timeout-ms
:60000
,单位毫秒yarn.resourcemanager.zk-num-retries
:使用默认值1000
,单位次
因此,BI
生产环境的重试间隔为60000/1000=60
,在保存任务状态不成功的条件下,会重试1000
次,每次间隔60
毫秒。很可怕,最终会导致YARN
堆内存(10G=4G[新生代] 6G[老年代]
)溢出。以下是SpaceX
监控到的使用以上2
个参数执行高频重试操作时JVM
的监控数据:
(1)堆内存使用量:
(2)GC
次数:
(3)Full GC
时间:
三、解决办法
1、调整YARN
在ZK
中保存的已完成任务数量参数,解决ZK
中保存太多已完成任务信息(默认值为10000
)使YARN
在ZK
中注册过多无用的watcher
,导致ZK
内存紧张,负载加大的问题。主要调整yarn.resourcemanager.state-store.max-completed-applications
和yarn.resourcemanager.max-completed-applications
参数,以下是调整后的参数值:
<!--ZK保存的已完成任务的最大数量-->
<property>
<name>yarn.resourcemanager.state-store.max-completed-applications</name>
<value>2000</value>
</property>
<!--RM内存中保存的已完成任务的最大数量,调整该参数主要是为了RM内存与ZK中保存的任务信息和数量一致-->
<property>
<name>yarn.resourcemanager.max-completed-applications</name>
<value>2000</value>
</property>
YARN
在ZK
中保存的任务状态信息(RM_APP_ROOT
)结构如下:
ROOT_DIR_PATH
|--- VERSION_INFO
|--- EPOCH_NODE
|--- RM_ZK_FENCING_LOCK
|--- RM_APP_ROOT
| |----- (#ApplicationId1)
| | |----- (#ApplicationAttemptIds)
| |
| |----- (#ApplicationId2)
| | |----- (#ApplicationAttemptIds)
| ....
|
|--- RM_DT_SECRET_MANAGER_ROOT
|----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME
|----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME
| |----- Token_1
| |----- Token_2
| ....
|
|----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME
| |----- Key_1
| |----- Key_2
....
|--- AMRMTOKEN_SECRET_MANAGER_ROOT
|----- currentMasterKey
|----- nextMasterKey
数据结构决定算法实现。从以上结构可以看出,一个任务ID
(ApplicationId
)会对应多个任务重试信息ID
(ApplicationAttemptId
),ZKRMStateStore
中对这些节点都注册了watcher
,因此节点太多会导致watcher
数量增加,消耗过多ZK
堆内存。BI
生产环境YARN
每天运行任务7000
左右,因此这里将以上两个参数调小为2000
,调整不会对运行时的任务状态信息产生影响。具体原因如下:
(1)从org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
类中与成员变量completedAppsInStateStore
和completedApps
相关的操作可以看出,以上两个配置保存的是已完成任务的信息。相关代码如下:
protected int completedAppsInStateStore = 0; //记录已完成任务的信息,任务完成自动加1
private LinkedList<ApplicationId> completedApps = new LinkedList<ApplicationId>();// 记录已完成任务的任务ID,任务完成执行remove
/**
* 保存已完成任务信息
* @param applicationId
*/
protected synchronized void finishApplication(ApplicationId applicationId) {
if (applicationId == null) {
LOG.error("RMAppManager received completed appId of null, skipping");
} else {
// Inform the DelegationTokenRenewer
if (UserGroupInformation.isSecurityEnabled()) {
rmContext.getDelegationTokenRenewer().applicationFinished(applicationId);
}
completedApps.add(applicationId);
completedAppsInStateStore ;
writeAuditLog(applicationId);
}
}
/*
* check to see if hit the limit for max # completed apps kept
*
* 检查存储在内存和ZK中已完成应用的数量是否超过最大限制,超过限制就执行移除已完成任务信息操作
*/
protected synchronized void checkAppNumCompletedLimit() {
// check apps kept in state store.
while (completedAppsInStateStore > this.maxCompletedAppsInStateStore) {
ApplicationId removeId =
completedApps.get(completedApps.size() - completedAppsInStateStore);
RMApp removeApp = rmContext.getRMApps().get(removeId);
LOG.info("Max number of completed apps kept in state store met:"
" maxCompletedAppsInStateStore = " maxCompletedAppsInStateStore
", removing app " removeApp.getApplicationId()
" from state store.");
rmContext.getStateStore().removeApplication(removeApp);
completedAppsInStateStore--;
}
// check apps kept in memorty.
while (completedApps.size() > this.maxCompletedAppsInMemory) {
ApplicationId removeId = completedApps.remove();
LOG.info("Application should be expired, max number of completed apps"
" kept in memory met: maxCompletedAppsInMemory = "
this.maxCompletedAppsInMemory ", removing app " removeId
" from memory: ");
rmContext.getRMApps().remove(removeId);
this.applicationACLsManager.removeApplication(removeId);
}
}
(2)修改前,YARN
在ZK
中保存的最大已完成任务信息数量使用默认值10000
,在zkdoctor
中查看/bi-rmstore-20190811-1/ZKRMStateRoot/RMAppRoot
子节点个数为10000
。调小后,在zkdoctor
中查看/bi-rmstore-20190811-1/ZKRMStateRoot/RMAppRoot
子节点个数为2015
,YARN
监控页面的实时数据显示当时运行15
个任务,那么也就是说,YARN
在该节点下保存的是运行中的任务和已完成任务的状态信息。zkdoctor
监控数据如下:
由此可以总结出,YARN
保存和移除任务状态的机制:
- 有新任务时,
YARN
使用ZKRMStateStore
的storeApplicationStateInternal
方法保存新任务的状态 - 当超过
yarn.resourcemanager.state-store.max-completed-applications
参数限制时,YARN
使用RMStateStore
的removeApplication
方法删除已完成任务的状态
RMStateStore
是ZKRMStateStore
的父类,以上两个方法都加了synchronized
同步关键字,两种操作相互独立,互不干扰,因此不会对YARN
中运行的任务产生影响。
2、解决重试间隔太短,导致YARN
堆内存紧张、GC
频繁问题:
<!--默认1000,这里设置成100是为了控制重试连接ZK的频率,高可用情况下,重试频率(yarn.resourcemanager.zk-retry-interval-ms )=yarn.resourcemanager.zk-timeout-ms(ZK session超时时间)/yarn.resourcemanager.zk-num-retries(重试次数)-->
<property>
<name>yarn.resourcemanager.zk-num-retries</name>
<value>100</value>
</property>
调整后,BI
生产环境YARN
连接ZK
的重试间隔是:60000/100=600
毫秒。SpaceX
监控到发生问题时的JVM
数据如下:
(1)堆内存使用量:
(2)GC
次数:
(3)Full GC
时间:
从监控数据可以看出,发生问题时,由于调大了重试间隔,JVM
堆内存使用、GC
次数以及时间消耗情况有所好转。
3、解决任务重试状态数据超过1M
的问题:
修改YARN
相关的逻辑会影响YARN
任务恢复机制,因此只能修改ZK
的服务端的配置和客户端的配置来解决此问题,修改方式如下:
(1)ZK
服务端jute.maxbuffer
参数大小调大至3M
(2)修改yarn-env.sh
,在YARN_OPTS
和YARN_RESOURCEMANAGER_OPTS
配置-Djute.maxbuffer=3145728
参数,该配置表示ZK
客户端提交给ZK
服务端的数据量最大为3M
。修改后的配置如下:
YARN_OPTS="$YARN_OPTS -Dyarn.policy.file=$YARN_POLICYFILE -Djute.maxbuffer=3145728"
YARN_RESOURCEMANAGER_OPTS="-server -Xms10240m -Xmx10240m -Xmn4048m -Xss512k -verbose:gc -Xloggc:$YARN_LOG_DIR/gc_resourcemanager.log-`date '%Y%m%d%H%M'` -XX: PrintGCDateStamps -XX: PrintGCDetails -XX:SurvivorRatio=8 -XX: UseParNewGC -XX: UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX: UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0 -XX: CMSClassUnloadingEnabled -XX: CMSParallelRemarkEnabled -XX: UseCMSInitiatingOccupancyOnly -XX: DisableExplicitGC -XX: HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$YARN_LOG_DIR -Djute.maxbuffer=3145728 $YARN_RESOURCEMANAGER_OPTS"
修改完成后,重启ResourceManager
服务和ZK
服务,使配置生效。
四、总结
1、Hadoop
的日志机制很完善,整个日志信息就是一个完整的事件流,因此遇到问题,一定要仔细阅读Hadoop
的日志信息,从中找到蛛丝马迹。
2、现在YARN
使用的这套ZK
集群,有HBase
和其他服务也在使用,随着集群规模的扩大和数据量的增长,会对ZK
产生一定的性能影响,因此建议给YARN
单独搭建一套ZK
使用,不要和会对ZK
产生高负载的应用共用一套ZK
集群。
3、调整ZK
的节点数据量最大为3M
,会对ZK
产生一定的性能影响,比如集群同步、请求处理,因此一定要完善ZK
这种基础服务的监控,保障高可用。
五、参考资料
yarn ResourceManager Active频繁易主问题排查
YARN源码分析(三)-----ResourceManager HA之应用状态存储与恢复
YARN
官方issue
:
(1)关于优化保存ZK
中的节点数据结构的issue
:Limit application resource reservation on nodes for non-node/rack specific requests
(2)ZKRMStateStore
更新数据超过1MB
引发ResourceManager
异常的issue
: ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
我的博客即将同步至腾讯云 社区,邀请大家一同入驻:https://cloud.tencent.com/developer/support-plan?invite_code=3t9oatmkekmc4