实战丨从源代码角度分析HBASE中ReversedScan问题
Posted 金融电子化
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了实战丨从源代码角度分析HBASE中ReversedScan问题相关的知识,希望对你有一定的参考价值。
欢迎金融科技工作者积极投稿!
各抒己见!
投稿邮箱:
newmedia@fcmag.com.cn
——金融电子化
近期,公司新上线了一个基于hbase的应用,在该应用中会涉及到大量hbase的scan行为。运行一阵子后,该应用间歇性的会出现scan非常缓慢的现象。公司还有大量其他基于hbase的应用,每次都只有新应用有这种现象。唯一与以往应用的区别是新应用中有ReversedScan的使用,难道是ReverseScan引起的?增加对应用的jstack监控,捕捉到慢scan发生时应用的jstack,发现了一些眉目。jstack中大量线程处于block状态。
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1319)
- waiting to lock <0x00000000876a58d8> (a java.lang.Object)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1177)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)
at org.apache.hadoop.hbase.client.ReversedClientScanner.nextScanner(ReversedClientScanner.java:124)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:140)
at
org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:135)
at
org.apache.hadoop.hbase.client.ReversedClientScanner.<init>(ReversedClientScanner.java:62)
at
org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:791)
对比hbase监控日志,同一时间meta所在的regionserver cpu占用会变的非常高。
看来是hbase-client在查询时,meta所在的regionserver cpu飙高,导致regionLocation变慢。但为什么regionserver的cpu会飙高?为什么其他应用却都保持正常?
问题研究
hbase-client在客户端侧会缓存RegionLocation,理论上真正要去regionserver查询regionLocation的次数应该并不多。
翻看源代码梳理ReversedScan时的具体逻辑。
ReversedClientScanner.java
class ReversedClientScanner extends ClientScanner {
......
@Override
protected boolean nextScanner(int nbRows, final boolean done)
throws IOException {
......
try {
byte[] locateStartRow = locateTheClosestFrontRow ? createClosestRowBefore(localStartKey)
: null;
callable = getScannerCallable(localStartKey, nbRows, locateStartRow);
// Open a scanner on the region server starting at the
// beginning of the region
this.caller.callWithRetries(callable);
this.currentRegion = callable.getHRegionInfo();
if (this.scanMetrics != null) {
this.scanMetrics.countOfRegions.incrementAndGet();
}
} catch (IOException e) {
ExceptionUtil.rethrowIfInterrupt(e);
close();
throw e;
}
......
}
protected ScannerCallable getScannerCallable(byte[] localStartKey,
int nbRows, byte[] locateStartRow) {
scan.setStartRow(localStartKey);
ScannerCallable s =
new ReversedScannerCallable(getConnection(), getTable(), scan, this.scanMetrics,
locateStartRow, rpcControllerFactory.newController());
s.setCaching(nbRows);
return s;
}
}
reversedScan时,会调用ReversedClientScanner.nextScanner方法。
nextScanner方法中,ReversedClientScanner创建了一个RpcRetryingCaller调用ReversedScannerCallable。
RpcRetryingCaller.callWithRetries会依次调用ReversedScannerCallable的prepare方法和call方法。
RpcRetryingCaller.java
public class RpcRetryingCaller<T> {
......
public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
this.callTimeout = callTimeout;
List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =
new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();
this.globalStartTime = EnvironmentEdgeManager.currentTimeMillis();
for (int tries = 0;; tries++) {
long expectedSleep = 0;
try {
beforeCall();
callable.prepare(tries != 0); // if called with false, check table status on ZK
return callable.call();
} catch (Throwable t) {
if (tries > startLogErrorsCnt) {
LOG.info("Call exception, tries=" + tries + ", retries=" + retries + ", retryTime=" +
(EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + "ms, msg="
+ callable.getExceptionMessageAdditionalDetail());
}
......
}
}
问题的关键就在于ReversedScannerCallable的prepare方法。
ReversedScannerCallable.java
public class ReversedScannerCallable extends ScannerCallable {
......
@Override
public void prepare(boolean reload) throws IOException {
if (!instantiated || reload) {
if (locateStartRow == null) {
// Just locate the region with the row
this.location = connection.getRegionLocation(tableName, row, reload);
if (this.location == null) {
throw new IOException("Failed to find location, tableName="
+ tableName + ", row=" + Bytes.toStringBinary(row) + ", reload="
+ reload);
}
} else {
// Need to locate the regions with the range, and the target location is
// the last one which is the previous region of last region scanner
List<HRegionLocation> locatedRegions = locateRegionsInRange(
locateStartRow, row, reload);
if (locatedRegions.isEmpty()) {
throw new DoNotRetryIOException(
"Does hbase:meta exist hole? Couldn't get regions for the range from "
+ Bytes.toStringBinary(locateStartRow) + " to "
+ Bytes.toStringBinary(row));
}
this.location = locatedRegions.get(locatedRegions.size() - 1);
}
setStub(getConnection().getClient(getLocation().getServerName()));
checkIfRegionServerIsRemote();
instantiated = true;
}
......
}
}
ReversedScannerCallable类的preapare方法中,会定位regionLocation,而useCache字段取值为reload。
回到RpcRetryingCaller的callWithRetries方法。
callWithRetries方法,在第一次尝试时传fasle,如果失败后续尝试传true。
这就导致了在reversedScan时,绝大多数场合下locationRegion时的usedCache一直为false!缓存没有生效,每次scan都要访问meta表获取scan rowkey所在的regionserver,这就是为什么meta所在的regionserver cpu占用非常高的原因。
ScannerCallable.java
public class ScannerCallable extends RegionServerCallable<Result[]> {
......
@Override
public void prepare(boolean reload) throws IOException {
if (Thread.interrupted()) {
throw new InterruptedIOException();
}
RegionLocations rl = RpcRetryingCallerWithReadReplicas.getRegionLocations(!reload,
id, getConnection(), getTableName(), getRow());
location = id < rl.size() ? rl.getRegionLocation(id) : null;
......
}
}
对比ScannerCallable类,其prepare方法中useCache取值为!reload,所以meta所在regionserver cpu占用高时,对其他应用没有特别大的影响。
问题解决
去github上查询了一下ReversedScannerCallable的修改纪录,发现HBASE-18665修复了该问题,但0.98版本分支上的代码并没有修复该问题。那只能自己动手了。
为尽可能降低重新编译引入的问题,单独对ReversedScannerCallable重新编译后,将class文件打进了hbase-client.jar。将hbase-client替换后,在测试环境一切正常,meta表的调用监控也显示调用量有明显下降。
变更投产后,生产也不再出现慢scan。
(点击查看精彩内容)
《金融电子化》新媒体部:主任 / 邝源 编辑 / 潘婧
以上是关于实战丨从源代码角度分析HBASE中ReversedScan问题的主要内容,如果未能解决你的问题,请参考以下文章
HBase实战 | 快手HBase在千亿级用户特征数据分析中的应用与实践