实战丨从源代码角度分析HBASE中ReversedScan问题

Posted 2021-04-19 金融电子化

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了实战丨从源代码角度分析HBASE中ReversedScan问题相关的知识，希望对你有一定的参考价值。

欢迎金融科技工作者积极投稿！

各抒己见！

投稿邮箱：

newmedia@fcmag.com.cn

——金融电子化

近期，公司新上线了一个基于hbase的应用，在该应用中会涉及到大量hbase的scan行为。运行一阵子后，该应用间歇性的会出现scan非常缓慢的现象。公司还有大量其他基于hbase的应用，每次都只有新应用有这种现象。唯一与以往应用的区别是新应用中有ReversedScan的使用，难道是ReverseScan引起的？增加对应用的jstack监控，捕捉到慢scan发生时应用的jstack，发现了一些眉目。jstack中大量线程处于block状态。

java.lang.Thread.State: BLOCKED (on object monitor)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1319)

- waiting to lock <0x00000000876a58d8> (a java.lang.Object)

at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1177)

at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)

at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)

at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)

at org.apache.hadoop.hbase.client.ReversedClientScanner.nextScanner(ReversedClientScanner.java:124)

at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:140)

org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:135)

org.apache.hadoop.hbase.client.ReversedClientScanner.<init>(ReversedClientScanner.java:62)

org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:791)

对比hbase监控日志，同一时间meta所在的regionserver cpu占用会变的非常高。

看来是hbase-client在查询时，meta所在的regionserver cpu飙高，导致regionLocation变慢。但为什么regionserver的cpu会飙高？为什么其他应用却都保持正常？

问题研究

hbase-client在客户端侧会缓存RegionLocation，理论上真正要去regionserver查询regionLocation的次数应该并不多。

翻看源代码梳理ReversedScan时的具体逻辑。

ReversedClientScanner.java

class ReversedClientScanner extends ClientScanner {

......

@Override

protected boolean nextScanner(int nbRows, final boolean done)

throws IOException {

......

try {

byte[] locateStartRow = locateTheClosestFrontRow ? createClosestRowBefore(localStartKey)

: null;

callable = getScannerCallable(localStartKey, nbRows, locateStartRow);

// Open a scanner on the region server starting at the

// beginning of the region

this.caller.callWithRetries(callable);

this.currentRegion = callable.getHRegionInfo();

if (this.scanMetrics != null) {

this.scanMetrics.countOfRegions.incrementAndGet();

}

} catch (IOException e) {

ExceptionUtil.rethrowIfInterrupt(e);

close();

throw e;

}

......

}

protected ScannerCallable getScannerCallable(byte[] localStartKey,

int nbRows, byte[] locateStartRow) {

scan.setStartRow(localStartKey);

ScannerCallable s =

new ReversedScannerCallable(getConnection(), getTable(), scan, this.scanMetrics,

locateStartRow, rpcControllerFactory.newController());

s.setCaching(nbRows);

return s;

}

reversedScan时，会调用ReversedClientScanner.nextScanner方法。

nextScanner方法中，ReversedClientScanner创建了一个RpcRetryingCaller调用ReversedScannerCallable。

RpcRetryingCaller.callWithRetries会依次调用ReversedScannerCallable的prepare方法和call方法。

RpcRetryingCaller.java

public class RpcRetryingCaller<T> {

......

public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)

throws IOException, RuntimeException {

this.callTimeout = callTimeout;

List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =

new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();

this.globalStartTime = EnvironmentEdgeManager.currentTimeMillis();

for (int tries = 0;; tries++) {

long expectedSleep = 0;

try {

beforeCall();

callable.prepare(tries != 0); // if called with false, check table status on ZK

return callable.call();

} catch (Throwable t) {

if (tries > startLogErrorsCnt) {

LOG.info("Call exception, tries=" + tries + ", retries=" + retries + ", retryTime=" +

(EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + "ms, msg="

+ callable.getExceptionMessageAdditionalDetail());

}

......

}

问题的关键就在于ReversedScannerCallable的prepare方法。

ReversedScannerCallable.java

public class ReversedScannerCallable extends ScannerCallable {

......

@Override

public void prepare(boolean reload) throws IOException {

if (!instantiated || reload) {

if (locateStartRow == null) {

// Just locate the region with the row

this.location = connection.getRegionLocation(tableName, row, reload);

if (this.location == null) {

throw new IOException("Failed to find location, tableName="

+ tableName + ", row=" + Bytes.toStringBinary(row) + ", reload="

+ reload);

}

} else {

// Need to locate the regions with the range, and the target location is

// the last one which is the previous region of last region scanner

List<HRegionLocation> locatedRegions = locateRegionsInRange(

locateStartRow, row, reload);

if (locatedRegions.isEmpty()) {

throw new DoNotRetryIOException(

"Does hbase:meta exist hole? Couldn't get regions for the range from "

+ Bytes.toStringBinary(locateStartRow) + " to "

+ Bytes.toStringBinary(row));

}

this.location = locatedRegions.get(locatedRegions.size() - 1);

}

setStub(getConnection().getClient(getLocation().getServerName()));

checkIfRegionServerIsRemote();

instantiated = true;

}

......

}

ReversedScannerCallable类的preapare方法中，会定位regionLocation，而useCache字段取值为reload。

回到RpcRetryingCaller的callWithRetries方法。

callWithRetries方法，在第一次尝试时传fasle，如果失败后续尝试传true。

这就导致了在reversedScan时，绝大多数场合下locationRegion时的usedCache一直为false！缓存没有生效，每次scan都要访问meta表获取scan rowkey所在的regionserver，这就是为什么meta所在的regionserver cpu占用非常高的原因。

ScannerCallable.java

public class ScannerCallable extends RegionServerCallable<Result[]> {

......

@Override

public void prepare(boolean reload) throws IOException {

if (Thread.interrupted()) {

throw new InterruptedIOException();

}

RegionLocations rl = RpcRetryingCallerWithReadReplicas.getRegionLocations(!reload,

id, getConnection(), getTableName(), getRow());

location = id < rl.size() ? rl.getRegionLocation(id) : null;

......

}

对比ScannerCallable类，其prepare方法中useCache取值为!reload，所以meta所在regionserver cpu占用高时，对其他应用没有特别大的影响。

问题解决

去github上查询了一下ReversedScannerCallable的修改纪录，发现HBASE-18665修复了该问题，但0.98版本分支上的代码并没有修复该问题。那只能自己动手了。

为尽可能降低重新编译引入的问题，单独对ReversedScannerCallable重新编译后，将class文件打进了hbase-client.jar。将hbase-client替换后，在测试环境一切正常，meta表的调用监控也显示调用量有明显下降。

变更投产后，生产也不再出现慢scan。

往期精选：

（点击查看精彩内容）

《金融电子化》新媒体部：主任 / 邝源编辑 / 潘婧

以上是关于实战丨从源代码角度分析HBASE中ReversedScan问题的主要内容，如果未能解决你的问题，请参考以下文章