故障Hbase服务异常

Posted 一木呈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了故障Hbase服务异常相关的知识,希望对你有一定的参考价值。

1 问题

        Hbase服务不能正常使用,Hmaster异常,Regionserver异常;

        日志持续增长。

2 异常日志

2.1 Hmaster关键异常日志

2018-05-25 10:19:12,737 DEBUG[hadoop001:60000.activeMasterManager] wal.WALProcedureStore: Opening state-log:FileStatus{path=hdfs://beh/hbase/MasterProcWALs/state-00000000000000036689.log;isDirectory=false; length=45760804; replication=3; blocksize=536870912;modification_time=1527123981127; access_time=1527165673882; owner=hadoop;group=hadoop; permission=rw-rw-r--; isSymlink=false}

2018-05-25 10:19:12,742 INFO  [hadoop001:60000.activeMasterManager]util.FSHDFSUtils: Recover lease on dfs filehdfs://beh/hbase/MasterProcWALs/state-00000000000000036690.log

2018-05-25 10:19:12,742 INFO  [hadoop001:60000.activeMasterManager]util.FSHDFSUtils: Recovered lease, attempt=0 onfile=hdfs://beh/hbase/MasterProcWALs/state-00000000000000036690.log after 0ms

2018-05-25 10:19:12,742 DEBUG[hadoop001:60000.activeMasterManager] wal.WALProcedureStore: Opening state-log:FileStatus{path=hdfs://beh/hbase/MasterProcWALs/state-00000000000000036690.log;isDirectory=false; length=45761668; replication=3; blocksize=536870912;modification_time=1527123982242; access_time=1527165673883; owner=hadoop;group=hadoop; permission=rw-rw-r--; isSymlink=false}

2018-05-25 10:19:12,767 INFO  [hadoop001:60000.activeMasterManager]util.FSHDFSUtils: Recover lease on dfs filehdfs://beh/hbase/MasterProcWALs/state-00000000000000036691.log

2018-05-25 10:19:12,768 INFO  [hadoop001:60000.activeMasterManager]util.FSHDFSUtils: Recovered lease, attempt=0 onfile=hdfs://beh/hbase/MasterProcWALs/state-00000000000000036691.log after 1ms

.

.

.

2018-05-25 10:29:29,656 DEBUG[B.defaultRpcServer.handler=31,queue=13,port=60000] ipc.RpcServer: B.defaultRpcServer.handler=31,queue=13,port=60000:callId: 301 service: RegionServerStatusService methodName: RegionServerStartupsize: 46 connection: 172.33.2.22:38698

org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:Server is not running yet

    at org.apache.hadoop.hbase.master.HMaster.checkServiceStarted(HMaster.java:2296)

    atorg.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:361)

    atorg.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8615)

    atorg.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)

    atorg.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)

    atorg.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

    atorg.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

    atjava.lang.Thread.run(Thread.java:745)

2.2 Regionserver关键异常日志

20180525 10:31:14,446 WARN [regionserver/hadoo031730 regionserver .HRegionServer: reportForDuty failed; sleeping and then retrying.
201805-25 10:31:17 446 INFO [regionserver/hadop030 regionserver .HRegionServer: reportForDuty to master=hadoop001 , 60o00, 1527214458906 with port=60025, startcode=1527214459823
20180525 10:31:17 447 DEBUG [regionserver/hadoo03730 regionserver .HRegionServer: Master is not running yet
20180525 10:31:17 447 WARN [regionserver/hadoo03730 regionserver .HRegionServer: reportForDuty failed
  sleeping and then retrying
20180525 10:31:20,447 INFO [regionserver/hadoo031730 regionserver .HRegionServer: reportForDuty to master=hadoop001 60000, 1527214458906 with port60025, startcode1527214459823
20180525 10:31:20,448 DEBUG [regionserver/hadoop003173 regionserver .HRegionServer: Master is not running yet
20180525 10:31:20,448 WARN [ regionserver/hadoop003173 regionserver .HRegionServer: reportForDuty failed
  sleeping and then retrying.
201805-25 10:31:23,448 INFO [regionserver/hadop030 regionserver .HRegionServer: reportForDuty to master=hadoop001 , 60000,1527214458906 with port=60025, startcode=1527214459823
20180525 10:31:23,449 DEBUG [regionserver/hadoop003/173 regionserver .HRegionServer: Master is not running yet

2.3 Datanode关键异常日志

2018-05-25 11:04:20,540 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Likely the client has stoppedreading, disconnecting it (hadoop028:50010:DataXceiver error processingREAD_BLOCK operation  src: /172.33.2.17:39882dst: /172.33.2.44:50010); java.net.SocketTimeoutException: 600000 millistimeout while waiting for channel to be ready for write. ch :java.nio.channels.SocketChannel[connected local=/172.33.2.44:50010remote=/172.33.2.17:39882]

2018-05-25 11:04:20,652 INFOorg.apache.hadoop.hdfs.server.datanode.DataNode: Likely the client has stoppedreading, disconnecting it (hadoop028:50010:DataXceiver error processingREAD_BLOCK operation  src:/172.33.2.17:39930 dst: /172.33.2.44:50010); java.net.SocketTimeoutException:600000 millis timeout while waiting for channel to be ready for write. ch :java.nio.channels.SocketChannel[connected local=/172.33.2.44:50010remote=/172.33.2.17:39930]

2018-05-25 11:04:21,088 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:Likely the client has stopped reading, disconnecting it(hadoop028:50010:DataXceiver error processing READ_BLOCK operation  src: /172.33.2.17:40038 dst:/172.33.2.44:50010); java.net.SocketTimeoutException: 600000 millis timeoutwhile waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connectedlocal=/172.33.2.44:50010 remote=/172.33.2.17:40038]

3 分析

  • 解决前以排除hdfs问题,datanode异常信息是由hbase Hmaster不能正常启动导致,172.33.2.17activezk确定)Hmaster节点;

  • 根据ReginserverHmaster的日志org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is notrunning yet

Master is not running yet

确定是Hmaster服务不能正常启动导致;

  • 根据Hmaster异常日志:2018-05-25 10:19:59,868 WARN [hadoop001:60000.activeMasterManager] wal.WALProcedureStore: Unable toread tracker for hdfs://beh/hbase/MasterProcWALs/state-00000000000000040786.log- Missing trailer: size=11 startPos=11查看目录hdfs://beh/hbase/MasterProcWALs,该目录总大小为1.3T大小

Ø 原因:Hmaster状态变为active状态,它就会有许多不同的日志来recover, lease, read;但是日志量巨大,是给了namenode很大压力,耗尽了tcp缓冲空间,导致服务恢复时间超长。

 

4 解决方案

        删除hdfs://beh/hbase/MasterProcWALs目录下的日志文件

5 建议

        定期检测hbasemaster服务,master状态异常hdfs://beh/hbase/MasterProcWALs会无限制增长。

6 参考

https://issues.apache.org/jira/browse/HDFS-3342

https://issues.apache.org/jira/browse/HBASE-14712

https://stackoverflow.com/questions/40223422/hbase-master-wont-start


以上是关于故障Hbase服务异常的主要内容,如果未能解决你的问题,请参考以下文章

Hbase故障处理汇总及评注

server2008故障转移群集异常

HBase一次客户端读写异常解读分析与优化全过程(干货)

无法处理故障异常

记一次服务器异常掉电,导致HBase Master is initializing 问题处理

记一次服务器异常掉电,导致HBase Master is initializing 问题处理