重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L

Posted 扫地增

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L相关的知识,希望对你有一定的参考价值。

hive查询

我于2022-06-10执行查询

 select 
    date_key ,
    count(1) 
 from test.ods_base_log_inc_test 
 group by date_key 
 order by date_key;

报错日志

Error: java.io.IOException: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlockBP-377577966-x.x.x.x-1654672531515:blk_1073741877_1064; getBlockSize()=8630; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.23.147.41:9866,DS-8a2a0685-0c29-4001-b047-a0c1d8821d83,DISK], DatanodeInfoWithStorage[10.23.47.190:9866,DS-b8b148bb-9418-4d84-897f-5e5904afcb56,DISK], DatanodeInfoWithStorage[10.23.214.139:9866,DS-9a41b9ff-20e3-49d7-a752-7ef248e85aa0,DISK]]
	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:420)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:175)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlockBP-377577966-10.23.202.0-1654672531515:blk_1073741877_1064; getBlockSize()=8630; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.23.147.41:9866,DS-8a2a0685-0c29-4001-b047-a0c1d8821d83,DISK], DatanodeInfoWithStorage[10.23.47.190:9866,DS-b8b148bb-9418-4d84-897f-5e5904afcb56,DISK], DatanodeInfoWithStorage[10.23.214.139:9866,DS-9a41b9ff-20e3-49d7-a752-7ef248e85aa0,DISK]]
	at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:362)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:269)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:200)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:184)
	at org.apache.hadoop.hdfs.DFSClient.openInternal(DFSClient.java:1043)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1006)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
	at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:57)
	at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:156)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:417)
	... 8 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 49  Reduce: 1   Cumulative CPU: 93.14 sec   HDFS Read: 2365415 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 1 minutes 33 seconds 140 msec

报错分析

报错是在执行hive查询的时候报的,从报错我们可以看出这是一个map reduce阶段读取数据时候报错,是读取文件数据块异常的问题,我们可以理解为数据文件块损坏或丢失问题。
遇到这个问题是因为之前升级HA集群关闭了较长一段时间,前端采集数据使用的是flume,关闭集群导致flume数据到hdfs文件没有正常关闭。因此在造成这个问题。因此我们需要找到没有正常关闭的文件并将他们修复。

问题解决

1、通过命令hdfs fsck /user/hive -openforwrite 检查哪些文件处于打开写的过程一直未关闭(根据自己实际要检查的目录来,如果确定是哪个目录下的话,范围可以更精准),可以看到有一些文件处于打开状态。但是这里有些文件是当前正在写的文件,我们要处理的是非当日的异常文件,当天的文件是正常的不要处理。

hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite

得到如下信息:

[xxx@xxxx ~]$ hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite
2022-06-10 15:04:25,074 INFO  [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-06-10 15:04:25,096 INFO  [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Connecting to namenode via http://dw-name-1:9870/fsck?ugi=xpc&openforwrite=1&path=%2Fbig-data%2Fods%2Fcommon%2Fods_sc_base_log
FSCK started by xpc (auth:SIMPLE) from /10.23.202.0 for path /big-data/ods/common/ods_sc_base_log at Fri Jun 10 15:04:25 CST 2022
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo 8630 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo 1866 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo 1154 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo 830 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo 2223 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo 1323 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo 10487 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-10/action_log_.1654842041420.lzo.tmp 863 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:

Status: HEALTHY
 Number of data-nodes:	4
 Number of racks:		1
 Total dirs:			5
 Total symlinks:		0

Replicated Blocks:
 Total size:	2070105 B
 Total files:	50
 Total blocks (validated):	50 (avg. block size 41402 B)
 Minimally replicated blocks:	42 (84.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	2.52
 Missing blocks:		0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)

Erasure Coded Block Groups:
 Total size:	0 B
 Total files:	0
 Total block groups (validated):	0
 Minimally erasure-coded block groups:	0
 Over-erasure-coded block groups:	0
 Under-erasure-coded block groups:	0
 Unsatisfactory placement block groups:	0
 Average block group size:	0.0
 Missing block groups:		0
 Corrupt block groups:		0
 Missing internal blocks:	0
FSCK ended at Fri Jun 10 15:04:25 CST 2022 in 2 milliseconds


The filesystem under path '/big-data/ods/common/ods_sc_base_log' is HEALTHY

我们发现文件件处于异常:

/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo

2、通过命令修复对应文件

hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo -retries 3

3、我们在通过命令hdfs fsck /user/hive -openforwrite 检查文件

hdfs fsck /user/hive -openforwrite 
[xxx@xxx ~]$ hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite
2022-06-10 15:33:40,877 INFO  [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-06-10 15:33:40,899 INFO  [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Connecting to namenode via http://dw-name-1:9870/fsck?ugi=xpc&openforwrite=1&path=%2Fbig-data%2Fods%2Fcommon%2Fods_sc_base_log
FSCK started by xpc (auth:SIMPLE) from /10.23.202.0 for path /big-data/ods/common/ods_sc_base_log at Fri Jun 10 15:33:41 CST 2022
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-10/action_log_.1654845647673.lzo.tmp 625 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
Status: HEALTHY
 Number of data-nodes:	4
 Number of racks:		1
 Total dirs:			5
 Total symlinks:		0

Replicated Blocks:
 Total size:	2536981 B
 Total files:	51
 Total blocks (validated):	51 (avg. block size 49744 B)
 Minimally replicated blocks:	50 (98.039215 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	2.9411764
 Missing blocks:		0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)

Erasure Coded Block Groups:
 Total size:	0 B
 Total files:	0
 Total block groups (validated):	0
 Minimally erasure-coded block groups:	0
 Over-erasure-coded block groups:	0
 Under-erasure-coded block groups:	0
 Unsatisfactory placement block groups:	0
 Average block group size:	0.0
 Missing block groups:		0
 Corrupt block groups:		0
 Missing internal blocks:	0
FSCK ended at Fri Jun 10 15:33:41 CST 2022 in 1 milliseconds


The filesystem under path '/big-data/ods/common/ods_sc_base_log' is HEALTHY

我们发现文件已经修复成功。
4、重新执行查询发现任务正常并得到结果

hive (test)> select date_key ,count(1) from test.ods_base_log_inc_test group by date_key order by date_key;
Query ID = xpc_20220610154222_f383d2c4-c31c-40c6-bbaf-f748b8fe37df
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1654770980875_0010, Tracking URL = http://dw-name-1:8088/proxy/application_1654770980875_0010/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1654770980875_0010
Hadoop job information for Stage-1: number of mappers: 50; number of reducers: 1
2022-06-10 15:42:29,416 Stage-1 map = 0%,  reduce = 0%
2022-06-10 15:42:36,573 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU 39.12 sec
2022-06-10 15:42:37,594 Stage-1 map = 76%,  reduce = 0%, Cumulative CPU 78.68 sec
2022-06-10 15:42:38,615 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU 98.9 sec
2022-06-10 15:42:39,634 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 104.24 sec
2022-06-10 15:42:40,654 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 106.04 sec
2022-06-10 15:42:41,678 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 108.53 sec
MapReduce Total cumulative CPU time: 1 minutes 48 seconds 530 msec
Ended Job = job_1654770980875_0010
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1654770980875_0011, Tracking URL = http://dw-name-1:8088/proxy/application_1654770980875_0011/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job  -kill job_1654770980875_0011
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-06-10 15:42:50,025 Stage-2 map = 0%,  reduce = 0%
2022-06-10 15:42:56,146 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.31 sec
2022-06-10 15:43:01,246 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3.63 sec
MapReduce Total cumulative CPU time: 3 seconds 630 msec
Ended Job = job_1654770980875_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 50  Reduce: 1   Cumulative CPU: 108.53 sec   HDFS Read: 2926070 HDFS Write: 285 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.63 sec   HDFS Read: 7746 HDFS Write: 273 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 52 seconds 160 msec
OK
date_key	_c1
2022-06-07	1474
2022-06-08	4702
2022-06-09	6026
2022-06-10	3520
Time taken: 39.342 seconds, Fetched: 4 row(s)

以上是关于重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L的主要内容,如果未能解决你的问题,请参考以下文章

集群出现块丢失,块找回,以及相关底层原理,fsck等

集群出现块丢失,块找回,以及相关底层原理,fsck等

RocketMQ集群平滑下线或重启某个节点

org.apache.hadoop.security.AccessControlException

Redis系列1-redis主从哨兵模式集群模式搭建

Redis系列1-redis主从哨兵模式集群模式搭建