https://github.com/sgp2004/JavaTools 代码地址
HBase客户端的行锁会对相同rowkey的读写造成很大影响,同一个进程并发更新rowkey的计数有可能造成阻塞(场景 热门短链点击增加 热门微博评论数).
例如一个线上问题:
代码语言:javascript复制转发微博
抱歉,此微博已被作者删除。查看帮助:http://t.cn/zWSudZc
| 转发| 收藏| 评论
所有被删除的微博里短链被引用的计数要减一,结果因为微博内容删除,只剩一个帮助短链,计数都减到帮助短链里,导致服务器响应缓慢
分析行锁关键代码总结一下:
client端:
1 HTable类代码,发现 lockRow 和 unlockRow方法都没有被使用到,0.96某个jira说到的client端去除lock不知道有什么用?只是移除无用代码?
2 HRegionServer类的lockRow方法只在HTable中调用。但是在测试中并没有执行到这个lockRow方法。
推测应用调用jar包时,在client端并不存在锁的问题。
server端:
HRegion 自行生成lockId并阻塞同一行的操作 ,去掉lockid从客户端的传递,增加MVCC,优化请求。
所以只是去掉了显式锁调用。
废话不说,上测试代码
代码语言:javascript复制 @Test
public void testMultiAdd() throws InterruptedException {
for (int i=0;i<100;i ){
final int finalI = i;
new Thread(new Runnable() {
@Override
public void run() {
System.out.println("start thread" finalI);
long time = System.currentTimeMillis();
int loop=0;
while (loop <1000)
//commonDao.insert("test1","f1","key","value" finalI); //for one same key test1 1000 loop,100 threads cost 53s ;if increase to 10000 loop cost 530s and may be timeout
//commonDao.insert("test1" finalI loop,"f1","key","value" finalI); // for different key 1000 loop,100 threads cost 14s,10000loop cost 113s
//commonDao.delete("test1"); //for one same key test1 1000 loop,100 threads cost 53s
//commonDao.delete("test1" finalI loop); // for different key 1000 loop,100 threads cost 13s
//commonDao.incr("test2","f1","key",1l); // for one same key test2 1000 loop,100 threads cost 59s
// commonDao.incr("test2" finalI loop,"f1","key",1l); // for different key 1000 loop,100 threads cost 15s
commonDao.getStrValue("test1","f1","key"); // for one same key test1 100 loop,100 threads cost 59s ??? why is it so slow?
// commonDao.getStrValue("test1" finalI loop,"f1","key"); // for different key 1000 loop,100 threads cost 12s
System.out.println(finalI "thread stop,use time:" (System.currentTimeMillis()-time));
}
}) .start();
}
TimeUnit.DAYS.sleep(3l);
}
代码语言:javascript复制commonDao是对原始HBase client的简单封装,隐藏表名,对常用字符串 整数 长整数进行封装bytes操作,
运行耗时表
100个线程 1000次循环,耗时(单位s):
操作 | 单rowkey | 变化的rowkey |
---|---|---|
insert | 53 | 14 |
delete | 53 | 13 |
计数加incr | 59 | 15 |
get | 600 | 12 |
对单key的写操作会出现超时,get操作比其他要慢10倍。并且get操作必须在delete之后,insert之后可以在10s左右运行完毕。
https://issues.apache.org/jira/browse/HBASE-7263 中 描述了 HBase的read/updates 流程:
(1) Acquire RowLock
(1a) BeginMVCC Finish MVCC
(2) Begin MVCC
(3) Do work
(4) Release RowLock
(5) Append to WAL
(6) Finish MVCC
Write-only operations (e.g. puts) 除了步骤1a,与上相同。
疑问:update和write有何区别?
Remove explicit RowLocks in 0.96
一、insert分析
先分析insert ,重点步骤在 HConnnectionManager 的 processBatchCallback方法
在retry 次数内进行一个循环
1 寻找对应region HRegionLocation loc = locateRegion(tableName, row.getRow());
step1 locateRegion 时首先会加锁 regionLockObject
This block guards against two threads trying to load the meta // region at the same time. The first will load the meta region and // the second will use the value that the first one found.
step2 生成一个metakey
byte [] metaKey = HRegionInfo.createRegionName(tableName, row, HConstants.NINES, false);
step3 查询metakey所在region
// Query the root or meta region for the location of the meta region regionInfoRow = server.getClosestRowBefore( metaLocation.getRegionInfo().getRegionName(), metaKey, HConstants.CATALOG_FAMILY);
得到的regionInfoRow 信息,Result类型,打印为kv:keyvalues={.META.,,1/info:regioninfo/1353046230286/Put/vlen=34/ts=0, .META.,,1/info:server/1353046237800/Put/vlen=40/ts=0, .META.,,1/info:serverstartcode/1353046237800/Put/vlen=8/ts=0, .META.,,1/info:v/1353046230286/Put/vlen=2/ts=0}
转换server信息为region server的ip和端口
value = regionInfoRow.getValue(HConstants.CATALOG_FAMILY, HConstants.SERVER_QUALIFIER);
ipAndPort:75-25-171-yf-core.jpool.sinaimg.cn:60020
这样就得到了row对应regionServer的地址
我们再回到processBatchCallback 的 locateRegion
代码语言:javascript复制 if (useCache) {
location = getCachedLocation(tableName, row);
if (location != null) {
return location;
}
}
第二次获取时会从cache中获取,不存在以上的锁的问题。所以第二次调用时可以回到processBatchCallback方法 往下进行
1 生成action
代码语言:javascript复制 Action<R> action = new Action<R>(row, i);
lastServers[i] = loc;
actions.add(regionName, action);
2 发送请求
代码语言:javascript复制 Map<HRegionLocation, Future<MultiResponse>> futures =
new HashMap<HRegionLocation, Future<MultiResponse>>(
actionsByServer.size());
for (Entry<HRegionLocation, MultiAction<R>> e: actionsByServer.entrySet()) {
futures.put(e.getKey(), pool.submit(createCallable(e.getKey(), e.getValue(), tableName)));
}
3 收集结果
没有发现有rowLock使用
HTablePool代码:
代码语言:javascript复制class PooledHTable implements HTableInterface {
private HTableInterface table; // actual table implementation
@Override
public RowLock lockRow(byte[] row) throws IOException {
return table.lockRow(row);
}
搜索lockRow 只找到在 “return table.lockRow(row);” 中调用,搜索HTable的lockRow方法也只在PooledHTable中使用,没发现外部使用,困惑。
lockRow方法调用了HRegionServer的lockRow方法。两个方法都在“1 Remove rowlocks as a client side API (https://issues.apache.org/jira/browse/HBASE-7315 )” 被移除。测试屏蔽掉这部分代码也没有任何异常,debug也没有打印,说明没有执行到。
client.Put
代码语言:javascript复制public Put(byte [] row, long ts, RowLock rowLock) {
if(row == null || row.length > HConstants.MAX_ROW_LENGTH) {
throw new IllegalArgumentException("Row key is invalid");
}
this.row = Arrays.copyOf(row, row.length);
this.ts = ts;
if(rowLock != null) {
this.lockId = rowLock.getLockId();
}
}
去掉rowLock.getLockId(); 也没有影响
至此看出在client端是没有锁的,只会设置lockId,也需要传入RowLock才设置。生成Get时,我们的调用代码默认也只生成无锁的Get对象。0.96中计划把无用代码去除。
那么对客户端设置lockId是否有用?
服务器端代码 HRegion.java:
代码语言:javascript复制 public Integer getLock(Integer lockid, byte [] row, boolean waitForLock)
throws IOException {
Integer lid = null;
if (lockid == null) {
lid = internalObtainRowLock(row, waitForLock);
} else {
if (!isRowLocked(lockid)) {
throw new IOException("Invalid row lock");
}
lid = lockid;
}
return lid;
}
传入的lockid需要在服务器端lockIds注册,传入null时服务器端会生成id,存入lockIds,传入lockid则会因为没有入口存入lockIds抛异常,经试验测试
Get get = new Get(row,new RowLock(row,1l));
Put put = new Put(Bytes.toBytes(rowkey),new RowLock(Bytes.toBytes(rowkey),1l));
确实是会抛异常,很坑爹的public 构造方法, 不过没有在HRegion抛,而是在HRegionServer?lockid在服务器端何时被初始化的?
代码语言:javascript复制Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.UnknownRowLockException: Invalid row lock
at org.apache.hadoop.hbase.regionserver.HRegionServer.getLockFromId(HRegionServer.java:2349)
at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:2259)
at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1021)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy6.delete(Unknown Source)
at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:714)
at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:712)
at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:163)
HRegionServer.java:
代码语言:javascript复制 Integer getLockFromId(long lockId) throws IOException {
if (lockId == -1L) {
return null;
}
String lockName = String.valueOf(lockId);
Integer rl = rowlocks.get(lockName);
if (rl == null) {
throw new UnknownRowLockException("Invalid row lock");
}
this.leases.renewLease(lockName);
return rl;
}
我们接下来看,server端 HRegion的put方法(未完待续)
HBase 0.96进行了很大的变动,rpc调用通过hbase-protocol模块实现,在其中重写了锁方法
Over in HBASE-7263 there has been some discussion about removing support
for explicit RowLocks in 0.96. This would involve the following:
- Remove lockRow/unlockRow functions in HTable and similar 。 replaces instances of RowLock with NullType.
- Remove constructors for Put/Delete/Increment/Get that take RowLocks
- functions in HRegion no longer take lockIds (checkAndPut, append,
increment, etc). This would affect coprocessors that call directly into
those functions.
1 Remove rowlocks as a client side API (https://issues.apache.org/jira/browse/HBASE-7315 )
2. Remove rowlocks from server code and replace it with better mechanism (https://issues.apache.org/jira/browse/HBASE-7263 )
The reasoning is as follows:
1) RowLocks are broken
They are only kept in the memory associated with the region, so on a
split, region move, RS crash, they just disappear
2) 0.96 is special
Now seems like a good time to clean things up since we've made some
incompatible changes already (e.g. protobufing) and we could have a cleaner
client implementation
3) RowLocks have been deprecated "in spirit" for awhile
Here's a post from 2009 cautioning against their use:
http://bb10.com/java-hadoop-hbase-user/2009-09/msg00239.html
and a more recent example:
http://permalink.gmane.org/gmane.comp.java.hadoop.hbase.user/23488
4) RowLocks are hard to use effectively
Clients can deadlock or starve themselves, either by forgetting to release
the RowLocks or by starving other non-contending row operations by
occupying server handlers stuck waiting to acquire the locks.