ZooKeeper节点数据量限制引起的Hadoop YARN ResourceManager崩溃原因分析(二)

2019-08-21 10:32:07 浏览数 (1)

时隔五个月(点击阅读前文),如标题所示的问题再次发生,本次由于我们大数据监控系统的完善,让我对该问题进行了更深一步的研究。以下是整个排查过程和解决方案:

一、问题说明

8月8日早上8点12收到第一条ResourceManager服务异常报警,截止到8月11日早上8点,每天早上8点到8点12之间频繁出现ResourceManager服务异常问题,晚上8点和下午1-3点偶尔出现该问题。以下是SpaceX统计出的ResourceManager状态异常次数数据:

二、异常原因

1、异常信息

以下截取的是8月8日20点至20点12之间的日志,其他时间段出现问题时的异常信息与此信息一样:

代码语言:javascript复制
2019-08-08 20:12:18,681 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 544
2019-08-08 20:12:18,886 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.204.245.44/10.204.245.44:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-08 20:12:18,887 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.204.245.44/10.204.245.44:5181, initiating session
2019-08-08 20:12:18,887 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.204.245.44/10.204.245.44:5181, sessionid = 0x26c00dfd48e9068, negotiated timeout = 60000
2019-08-08 20:12:20,850 WARN org.apache.zookeeper.ClientCnxn: Session 0x26c00dfd48e9068 for server 10.204.245.44/10.204.245.44:5181, unexpected error, closing socket connection and attempting reconnect
java.lang.OutOfMemoryError: Java heap space
2019-08-08 20:12:20,951 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:989)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:986)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1128)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1161)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:986)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1000)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1017)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:713)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:243)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:226)
	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:812)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:872)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:867)
	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
	at java.lang.Thread.run(Thread.java:745)

2、异常原因

主要是由于ZK服务端限制单个节点数据量大小不能超过1M导致,客户端提交的数据超过1MZK服务端会抛出如下异常:

代码语言:javascript复制
Exception causing close of session 0x2690d678e98ae8b due to java.io.IOException: Len error 1788046

抛出异常后,YARN会不断地对ZK进行重试操作,重试间隔短,重试次数多,使YARN内存溢出,不能正常提供服务。

3、YARN异常代码

以下是org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore中发生异常的代码的方法:

代码语言:javascript复制
   /**
     * 更新任务重试信息
     *
     * @param appAttemptId
     * @param attemptStateDataPB
     * @throws Exception
     */
    @Override
    public synchronized void updateApplicationAttemptStateInternal(
            ApplicationAttemptId appAttemptId,
            ApplicationAttemptStateData attemptStateDataPB)
            throws Exception {
        String appIdStr = appAttemptId.getApplicationId().toString();
        String appAttemptIdStr = appAttemptId.toString();
        String appDirPath = getNodePath(rmAppRoot, appIdStr);
        String nodeUpdatePath = getNodePath(appDirPath, appAttemptIdStr);
        if (LOG.isDebugEnabled()) {
            LOG.debug("Storing final state info for attempt: "   appAttemptIdStr
                      " at: "   nodeUpdatePath);
        }
        byte[] attemptStateData = attemptStateDataPB.getProto().toByteArray();

        if (existsWithRetries(nodeUpdatePath, true) != null) {
            setDataWithRetries(nodeUpdatePath, attemptStateData, -1);
        } else {
            createWithRetries(nodeUpdatePath, attemptStateData, zkAcl,
                    CreateMode.PERSISTENT);
            LOG.debug(appAttemptId   " znode didn't exist. Created a new znode to"
                      " update the application attempt state.");
        }
    }

这段代码主要是执行更新或添加任务重试状态信息到ZK中的操作,YARN在调度任务过程中,可能会对任务进行多次重试,主要受网络、硬件、资源等因素影响。如果任务重试信息保存ZK失败,会调用org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.ZKAction.runWithRetries方法重试。默认重试1000次,每次重试间隔受是否启用YARN高可用影响,也就是yarn-site.xml中的yarn.resourcemanager.ha.enabled参数是否为true。该重试间隔官方解释如下:

代码语言:javascript复制
Retry interval in milliseconds when connecting to ZooKeeper. When HA is enabled, the value here is NOT used. It is generated automatically from yarn.resourcemanager.zk-timeout-ms and yarn.resourcemanager.zk-num-retries.

在是否启用YARN高可用条件下,重试间隔机制如下:

(1)未启用YARN高可用:

yarn.resourcemanager.zk-retry-interval-ms控制,该参数在BI生产环境使用默认值1000,单位为毫秒。

(2)启用YARN高可用:

yarn.resourcemanager.zk-timeout-msZK会话超时时间)和yarn.resourcemanager.zk-num-retries(操作失败后重试次数)参数控制,计算公式为:

代码语言:javascript复制
重试时间间隔(yarn.resourcemanager.zk-retry-interval-ms )=yarn.resourcemanager.zk-timeout-ms(ZK session超时时间)/yarn.resourcemanager.zk-num-retries(重试次数)

重试间隔确定过程在org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.initInternal方法源码为:

代码语言:javascript复制
// 计算重试连接ZK的时间间隔,以毫秒表示
if (HAUtil.isHAEnabled(conf)) { // 高可用情况下是:重试时间间隔=session超时时间/重试ZK的次数
    zkRetryInterval = zkSessionTimeout / numRetries;
} else {
    zkRetryInterval =
            conf.getLong(YarnConfiguration.RM_ZK_RETRY_INTERVAL_MS,
                    YarnConfiguration.DEFAULT_RM_ZK_RETRY_INTERVAL_MS);
}

BI生产环境的配置:

  • yarn.resourcemanager.zk-timeout-ms60000,单位毫秒
  • yarn.resourcemanager.zk-num-retries:使用默认值1000,单位次

因此,BI生产环境的重试间隔为60000/1000=60,在保存任务状态不成功的条件下,会重试1000次,每次间隔60毫秒。很可怕,最终会导致YARN堆内存(10G=4G[新生代] 6G[老年代])溢出。以下是SpaceX监控到的使用以上2个参数执行高频重试操作时JVM的监控数据:

(1)堆内存使用量:

(2)GC次数:

(3)Full GC时间:

三、解决办法

1、调整YARNZK中保存的已完成任务数量参数,解决ZK中保存太多已完成任务信息(默认值为10000)使YARNZK中注册过多无用的watcher,导致ZK内存紧张,负载加大的问题。主要调整yarn.resourcemanager.state-store.max-completed-applicationsyarn.resourcemanager.max-completed-applications参数,以下是调整后的参数值:

代码语言:javascript复制
<!--ZK保存的已完成任务的最大数量-->
<property>
  <name>yarn.resourcemanager.state-store.max-completed-applications</name>
  <value>2000</value>
</property>

<!--RM内存中保存的已完成任务的最大数量,调整该参数主要是为了RM内存与ZK中保存的任务信息和数量一致-->
<property>
  <name>yarn.resourcemanager.max-completed-applications</name>
  <value>2000</value>
</property>

YARNZK中保存的任务状态信息(RM_APP_ROOT)结构如下:

代码语言:javascript复制
    ROOT_DIR_PATH
      |--- VERSION_INFO
      |--- EPOCH_NODE
      |--- RM_ZK_FENCING_LOCK
      |--- RM_APP_ROOT
      |     |----- (#ApplicationId1)
      |     |        |----- (#ApplicationAttemptIds)
      |     |
      |     |----- (#ApplicationId2)
      |     |       |----- (#ApplicationAttemptIds)
      |     ....
      |
      |--- RM_DT_SECRET_MANAGER_ROOT
      |----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME
      |----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME
      |       |----- Token_1
      |       |----- Token_2
      |       ....
      |
      |----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME
      |      |----- Key_1
      |      |----- Key_2
      ....
      |--- AMRMTOKEN_SECRET_MANAGER_ROOT
      |----- currentMasterKey
      |----- nextMasterKey

数据结构决定算法实现。从以上结构可以看出,一个任务IDApplicationId)会对应多个任务重试信息IDApplicationAttemptId),ZKRMStateStore中对这些节点都注册了watcher,因此节点太多会导致watcher数量增加,消耗过多ZK堆内存。BI生产环境YARN每天运行任务7000左右,因此这里将以上两个参数调小为2000,调整不会对运行时的任务状态信息产生影响。具体原因如下:

(1)从org.apache.hadoop.yarn.server.resourcemanager.RMAppManager类中与成员变量completedAppsInStateStorecompletedApps相关的操作可以看出,以上两个配置保存的是已完成任务的信息。相关代码如下:

代码语言:javascript复制
protected int completedAppsInStateStore = 0; //记录已完成任务的信息,任务完成自动加1
private LinkedList<ApplicationId> completedApps = new LinkedList<ApplicationId>();// 记录已完成任务的任务ID,任务完成执行remove

 /**
   * 保存已完成任务信息
   * @param applicationId
   */
  protected synchronized void finishApplication(ApplicationId applicationId) {
    if (applicationId == null) {
      LOG.error("RMAppManager received completed appId of null, skipping");
    } else {
      // Inform the DelegationTokenRenewer
      if (UserGroupInformation.isSecurityEnabled()) {
        rmContext.getDelegationTokenRenewer().applicationFinished(applicationId);
      }
      
      completedApps.add(applicationId);
      completedAppsInStateStore  ;
      writeAuditLog(applicationId);
    }
  }

  /*
   * check to see if hit the limit for max # completed apps kept
   *
   * 检查存储在内存和ZK中已完成应用的数量是否超过最大限制,超过限制就执行移除已完成任务信息操作
   */
  protected synchronized void checkAppNumCompletedLimit() {
    // check apps kept in state store.
    while (completedAppsInStateStore > this.maxCompletedAppsInStateStore) {
      ApplicationId removeId =
          completedApps.get(completedApps.size() - completedAppsInStateStore);
      RMApp removeApp = rmContext.getRMApps().get(removeId);
      LOG.info("Max number of completed apps kept in state store met:"
            " maxCompletedAppsInStateStore = "   maxCompletedAppsInStateStore
            ", removing app "   removeApp.getApplicationId()
            " from state store.");
      rmContext.getStateStore().removeApplication(removeApp);
      completedAppsInStateStore--;
    }

    // check apps kept in memorty.
    while (completedApps.size() > this.maxCompletedAppsInMemory) {
      ApplicationId removeId = completedApps.remove();
      LOG.info("Application should be expired, max number of completed apps"
            " kept in memory met: maxCompletedAppsInMemory = "
            this.maxCompletedAppsInMemory   ", removing app "   removeId
            " from memory: ");
      rmContext.getRMApps().remove(removeId);
      this.applicationACLsManager.removeApplication(removeId);
    }
  }

(2)修改前,YARNZK中保存的最大已完成任务信息数量使用默认值10000,在zkdoctor中查看/bi-rmstore-20190811-1/ZKRMStateRoot/RMAppRoot子节点个数为10000 。调小后,在zkdoctor中查看/bi-rmstore-20190811-1/ZKRMStateRoot/RMAppRoot子节点个数为2015YARN监控页面的实时数据显示当时运行15个任务,那么也就是说,YARN在该节点下保存的是运行中的任务和已完成任务的状态信息。zkdoctor监控数据如下:

由此可以总结出,YARN保存和移除任务状态的机制:

  • 有新任务时,YARN使用ZKRMStateStorestoreApplicationStateInternal方法保存新任务的状态
  • 当超过yarn.resourcemanager.state-store.max-completed-applications参数限制时,YARN使用RMStateStoreremoveApplication方法删除已完成任务的状态

RMStateStoreZKRMStateStore的父类,以上两个方法都加了synchronized同步关键字,两种操作相互独立,互不干扰,因此不会对YARN中运行的任务产生影响。

2、解决重试间隔太短,导致YARN堆内存紧张、GC频繁问题:

代码语言:javascript复制
<!--默认1000,这里设置成100是为了控制重试连接ZK的频率,高可用情况下,重试频率(yarn.resourcemanager.zk-retry-interval-ms )=yarn.resourcemanager.zk-timeout-ms(ZK session超时时间)/yarn.resourcemanager.zk-num-retries(重试次数)-->
<property>
  <name>yarn.resourcemanager.zk-num-retries</name>
  <value>100</value>
</property>

调整后,BI生产环境YARN连接ZK的重试间隔是:60000/100=600毫秒。SpaceX监控到发生问题时的JVM数据如下:

(1)堆内存使用量:

(2)GC次数:

(3)Full GC时间:

从监控数据可以看出,发生问题时,由于调大了重试间隔,JVM堆内存使用、GC次数以及时间消耗情况有所好转。

3、解决任务重试状态数据超过1M的问题:

修改YARN相关的逻辑会影响YARN任务恢复机制,因此只能修改ZK的服务端的配置和客户端的配置来解决此问题,修改方式如下:

(1)ZK服务端jute.maxbuffer参数大小调大至3M

(2)修改yarn-env.sh,在YARN_OPTSYARN_RESOURCEMANAGER_OPTS配置-Djute.maxbuffer=3145728参数,该配置表示ZK客户端提交给ZK服务端的数据量最大为3M。修改后的配置如下:

代码语言:javascript复制
YARN_OPTS="$YARN_OPTS -Dyarn.policy.file=$YARN_POLICYFILE -Djute.maxbuffer=3145728"

YARN_RESOURCEMANAGER_OPTS="-server -Xms10240m -Xmx10240m -Xmn4048m -Xss512k -verbose:gc -Xloggc:$YARN_LOG_DIR/gc_resourcemanager.log-`date  '%Y%m%d%H%M'` -XX: PrintGCDateStamps -XX: PrintGCDetails -XX:SurvivorRatio=8 -XX: UseParNewGC -XX: UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX: UseCMSCompactAtFullCollection 
-XX:CMSFullGCsBeforeCompaction=0 -XX: CMSClassUnloadingEnabled -XX: CMSParallelRemarkEnabled -XX: UseCMSInitiatingOccupancyOnly -XX: DisableExplicitGC -XX: HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$YARN_LOG_DIR -Djute.maxbuffer=3145728 $YARN_RESOURCEMANAGER_OPTS"

修改完成后,重启ResourceManager服务和ZK服务,使配置生效。

四、总结

1、Hadoop的日志机制很完善,整个日志信息就是一个完整的事件流,因此遇到问题,一定要仔细阅读Hadoop的日志信息,从中找到蛛丝马迹。

2、现在YARN使用的这套ZK集群,有HBase和其他服务也在使用,随着集群规模的扩大和数据量的增长,会对ZK产生一定的性能影响,因此建议给YARN单独搭建一套ZK使用,不要和会对ZK产生高负载的应用共用一套ZK集群。

3、调整ZK的节点数据量最大为3M,会对ZK产生一定的性能影响,比如集群同步、请求处理,因此一定要完善ZK这种基础服务的监控,保障高可用。

五、参考资料

yarn ResourceManager Active频繁易主问题排查

YARN源码分析(三)-----ResourceManager HA之应用状态存储与恢复

YARN官方issue

(1)关于优化保存ZK中的节点数据结构的issue:Limit application resource reservation on nodes for non-node/rack specific requests

(2)ZKRMStateStore更新数据超过1MB引发ResourceManager异常的issue: ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

我的博客即将同步至腾讯云 社区,邀请大家一同入驻:https://cloud.tencent.com/developer/support-plan?invite_code=3t9oatmkekmc4

0 人点赞