HMaster组件异常导致Flink写HBase任务频繁重启问题解决
Posted csdn_lan
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HMaster组件异常导致Flink写HBase任务频繁重启问题解决相关的知识,希望对你有一定的参考价值。
项目场景:
项目主题是微信用户对微信公众号和小程序的操作行为的分析。具体实现是Flink处理Kafka收集到的微信事件日志,以30s的滚动窗口进行实时处理,并写入ClickHouse,由前端UI直接引用。为了数据安全和后续需求,也需要进行到HBase的全量备份工作。
问题描述
0级问题:Flink写HBase历史备份任务失败
1级问题:HBase服务的HMaster组件异常
2级问题:HDFS的DataNode资源使用率过高
3级问题:SparkHistory日志滚动清除策略失效
报错日志:
HMaster运行日志:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/MasterData/WALs/10.132.138.214,16000,1675157499749/10.132.138.214%2C16000%2C1675157499749.1675157503665 could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
原因分析:
SparkHistory的任务日志未按照目标计划清除,导致HDFS磁盘占用过多。
从而使HMaster自检失败且无法正常处理RegionServer发来的数据处理请求。
解决方案:
1. 手动清除/spark-history相关日志数据
2. DataNode的磁盘使用量恢复至均20%
3. 重启HMaster服务时,出现新的问题
新问题:
文件大小为0:hadoop fs -ls hdfs://emr-header-1.cluster-215469:9000/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals/10.132.138.214%2C16000%2C1641264890087.1674868727653
此为HMaster的元数据预写日志(HMaster故障恢复使用)。但故障恢复时由于wal异常为空文件,导致恢复HMaster频繁失败,且经查此类文件在recovered.wals中占比超过(2000/3000)
HMaster重启时运行日志:
2023-02-01 16:21:04,722 INFO [StoreOpener-1595e783b53d99cd5eef43b6debb2682-1] regionserver.HStore: 1595e783b53d99cd5eef43b6debb2682/proc created, memstore type=DefaultMemStore, storagePolicy=HOT, verifyBulkLoads=false, parallelPutCountPrintThreshold=50, encoding=NONE, compression=NONE
2023-02-01 16:21:04,832 INFO [master/emr-header-1:16000:becomeActiveMaster] regionserver.HRegion: Replaying edits from hdfs://emr-header-1.cluster-215469:9000/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals/10.132.138.214%2C16000%2C1641264890087.1674868727653
2023-02-01 16:21:04,836 WARN [master/emr-header-1:16000:becomeActiveMaster] regionserver.HRegion: Failed initialize of region= master:store,,1.1595e783b53d99cd5eef43b6debb2682., starting to roll back memstore
java.io.EOFException: Cannot seek after EOF
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1447)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:211)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:173)
at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:64)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:323)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:305)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:293)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:429)
at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4863)
at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:4769)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:1013)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:955)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7500)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7458)
at org.apache.hadoop.hbase.master.region.MasterRegion.open(MasterRegion.java:269)
at org.apache.hadoop.hbase.master.region.MasterRegion.create(MasterRegion.java:309)
at org.apache.hadoop.hbase.master.region.MasterRegionFactory.create(MasterRegionFactory.java:104)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:948)
at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2237)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:621)
at java.lang.Thread.run(Thread.java:748)
2023-02-01 16:21:04,852 INFO [master/emr-header-1:16000:becomeActiveMaster] regionserver.HRegion: Drop memstore for Store proc in region master:store,,1.1595e783b53d99cd5eef43b6debb2682. , dropped memstoresize: [dataSize=0, getHeapSize=256, getOffHeapSize=0, getCellsCount=0
2023-02-01 16:21:04,852 INFO [master/emr-header-1:16000:becomeActiveMaster] regionserver.HRegion: Closing region master:store,,1.1595e783b53d99cd5eef43b6debb2682.
2023-02-01 16:21:04,878 INFO [master/emr-header-1:16000:becomeActiveMaster] regionserver.HRegion: Closed master:store,,1.1595e783b53d99cd5eef43b6debb2682.
新问题解决:
1. 关闭旧的元数据预写日志,重新生成新的,即可。(2023年02月01日21:28:01)
hadoop fs -mv /hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals /hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals.bak
2. 之后HMaster自动拉起
3. 从而Flink写HBase历史备份任务自动恢复。
遗留问题:
为何在SparkHistory日志清除策略正确配置的情况下,/spark-history中存储的app-log仍旧会保留所有历史
(2023年02月01日22:03:53)
注1:SparkHistoryServer日志清除策略配置,见spark-default.conf
spark.history.fs.cleaner.enabled = true #是否开启清理,默认false 一定要设置为true,否则磁盘会占满
spark.history.fs.cleaner.interval = 1d #多久清理一次
spark.history.fs.cleaner.maxAge = 7d #保留多久的数据
spark.history.retainedApplications 50 #默认50 缓存的app信息个数,不要太大。否则很占内存
注2:阿里云提供的解答“任务正常运行的日志可以自动删除,其他的运行失败,还有run等等,自动删除不了,需要手动删除”经检测无用。(2023年02月09日07:01:57)
遗留问题解决:
配置修改后仅重启了Spark ThriftServer组件,应Spark所有组件,包括Spark Client和SparkHistory
重启后,观察问题已解决,/spark-history中仅保留了7天的app日志。
(2023年02月10日15:42:35)
以上是关于HMaster组件异常导致Flink写HBase任务频繁重启问题解决的主要内容,如果未能解决你的问题,请参考以下文章
hbase hmaster故障分析及解决方案:Timedout 300000ms waiting for namespace table to be assigned