HBase client 客户端重试机制

文章目录

- - 背景
  - 代码解析
  - 重要参数设置
  - 参数推荐

背景

在hbase集群故障时，hbase client无法连接region server的时候，因为重试参数配置问题，程序并不会直接抛出异常，而是会一直重试，导致异常报警没有触发。此篇文章讲述client的重试机制及参数配置。

代码解析

RpcRetryingCall.java 中 callWithRetries函数是Rpc请求重试机制的实现, 可以参考以下源码(hbase版本为1.2.1)

代码语言：javascript复制

/**
* Retries if invocation fails.
* @param callTimeout Timeout for this call
* @param callable The {@link RetryingCallable} to run.
* @return an object of type T
* @throws IOException if a remote or network exception occurs
* @throws RuntimeException other unspecified error
*/
public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =
  new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();
this.globalStartTime = EnvironmentEdgeManager.currentTime();
context.clear();
for (int tries = 0;; tries  ) {
  long expectedSleep;
  try {
    callable.prepare(tries != 0); // if called with false, check table status on ZK
    interceptor.intercept(context.prepare(callable, tries));
    return callable.call(getRemainingTime(callTimeout));
  } catch (PreemptiveFastFailException e) {
    throw e;
  } catch (Throwable t) {
    ExceptionUtil.rethrowIfInterrupt(t);
    if (tries > startLogErrorsCnt) {
      LOG.info("Call exception, tries="   tries   ", retries="   retries   ", started="  
          (EnvironmentEdgeManager.currentTime() - this.globalStartTime)   " ms ago, "
            "cancelled="   cancelled.get()   ", msg="
            callable.getExceptionMessageAdditionalDetail());
    }

    // translateException throws exception when should not retry: i.e. when request is bad.
    interceptor.handleFailure(context, t);
    t = translateException(t);
    callable.throwable(t, retries != 1);
    RetriesExhaustedException.ThrowableWithExtraContext qt =
        new RetriesExhaustedException.ThrowableWithExtraContext(t,
            EnvironmentEdgeManager.currentTime(), toString());
    exceptions.add(qt);
    if (tries >= retries - 1) {
      throw new RetriesExhaustedException(tries, exceptions);
    }
    // If the server is dead, we need to wait a little before retrying, to give
    //  a chance to the regions to be
    // tries hasn't been bumped up yet so we use "tries   1" to get right pause time
    expectedSleep = callable.sleep(pause, tries   1);

    // If, after the planned sleep, there won't be enough time left, we stop now.
    long duration = singleCallDuration(expectedSleep);
    if (duration > callTimeout) {
      String msg = "callTimeout="   callTimeout   ", callDuration="   duration  
          ": "   callable.getExceptionMessageAdditionalDetail();
      throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));
    }
  } finally {
    interceptor.updateFailureInfo(context);
  }
  try {
    if (expectedSleep > 0) {
      synchronized (cancelled) {
        if (cancelled.get()) return null;
        cancelled.wait(expectedSleep);
      }
    }
    if (cancelled.get()) return null;
  } catch (InterruptedException e) {
    throw new InterruptedIOException("Interrupted after "   tries   " tries  on "   retries);
  }
}
}

HBase客户端请求在那个时间段网络有异常导致rpc请求失败，会进入重试逻辑根据HBase的重试机制（退避机制），每两次重试机制之间会休眠一段时间，即cancelled.wait(expectedSleep)，这个休眠时间太长导致这个线程一直处于TIME_WAITING状态。休眠时间由expectedSleep = callable.sleep(pause,tries 1)决定，根据hbase算法，默认最大的expectedSleep为20s，整个重试时间会持续8min，这也就是说全局锁会被持有8min。

重要参数设置

hbase.client.pause

失败重试时等待时间，随着重试次数越多,重试等待时间越长，计算方式如下所示：

代码语言：javascript复制

public static int RETRY_BACKOFF[] = { 1, 2, 3, 5, 10, 20, 40, 100, 100, 100, 100, 200, 200 }; long normalPause = pause * HConstants.RETRY_BACKOFF[ntries];long jitter = (long)(normalPause * RANDOM.nextFloat() * 0.01f);

所以如果重试10次,hbase.client.pause=50ms，则每次重试等待时间为{50，100，150，250，500，1000，2000，5000，5000，5000}。

属性默认值为100ms,可以设置为50ms，甚至更小。

hbase.client.retries.number

失败时重试次数,默认为31次。可以根据自己应用的需求将该值调整的比较小。比如整个提供应用的超时时间为3s,则根据上面重试时间计算方法,可以将重试次数调整为3次。

hbase.rpc.timeout

该参数表示一次RPC请求的超时时间。如果某次RPC时间超过该值，客户端就会主动关闭socket。

默认该值为1min,应用为在线服务时,可以根据应用的超时时间,设置该值.如果应用总共超时为3s,则该值也应该为3s或者更小.

hbase.client.operation.timeout

该参数表示HBase客户端发起一次数据操作直至得到响应之间总的超时时间，数据操作类型包括get、append、increment、delete、put等。该值与hbase.rpc.timeout的区别为,hbase.rpc.timeout为一次rpc调用的超时时间。而hbase.client.operation.timeout为一次操作总的时间(从开始调用到重试n次之后失败的总时间)。

举个例子说明，比如一次Put请求，客户端首先会将请求封装为一个caller对象，该对象发送RPC请求到服务器，假如此时因为服务器端正好发生了严重的Full GC，导致这次RPC时间超时引起SocketTimeoutException，对应的就是hbase.rpc.timeout。那假如caller对象发送RPC请求之后刚好发生网络抖动，进而抛出网络异常，HBase客户端就会进行重试，重试多次之后如果总操作时间超时引起SocketTimeoutException，对应的就是hbase.client.operation.timeout。

hbase.client.scanner.timeout.period

该参数是表示HBase客户端发起一次scan操作的rpc调用至得到响应之间总的超时时间。一次scan操作是指发起一次regionserver rpc调用的操作,hbase会根据scan查询条件的cacheing、batch设置将scan操作会分成多次rpc操作。比如满足scan条件的rowkey数量为10000个，scan查询的cacheing=200，则查询所有的结果需要执行的rpc调用次数为50个。而该值是指50个rpc调用的单个相应时间的最大值。

参数推荐

在网络出现抖动的异常情况下，默认最差情况下一个线程会存在8min左右的重试时间，从而会导致其他线程都阻塞在regionLockObject这把全局锁上。为了构建一个更稳定、低延迟的HBase系统，除过需要对服务器端参数做各种调整外，客户端参数也需要做相应的调整：

hbase.client.pause:默认为100，可以减少为50
hbase.client.retries.number:默认为31，可以减少为21

修改后，通过上面算法可以计算出每次连接集群重试之间的暂停时间将依次为：

［50，100，150，250，500，1000，2000，5000，5000，5000，5000，10000，10000，…，10000］

客户端将会在2min内重试20次，然后放弃连接到集群，进而会再将全局锁交给其他线程，执行其他请求。

rpc hbase TDSQLMySQL版

0 人点赞