重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L
Posted 扫地增
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L相关的知识,希望对你有一定的参考价值。
hive查询
我于2022-06-10执行查询
select
date_key ,
count(1)
from test.ods_base_log_inc_test
group by date_key
order by date_key;
报错日志
Error: java.io.IOException: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlockBP-377577966-x.x.x.x-1654672531515:blk_1073741877_1064; getBlockSize()=8630; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.23.147.41:9866,DS-8a2a0685-0c29-4001-b047-a0c1d8821d83,DISK], DatanodeInfoWithStorage[10.23.47.190:9866,DS-b8b148bb-9418-4d84-897f-5e5904afcb56,DISK], DatanodeInfoWithStorage[10.23.214.139:9866,DS-9a41b9ff-20e3-49d7-a752-7ef248e85aa0,DISK]]
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:420)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:175)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlockBP-377577966-10.23.202.0-1654672531515:blk_1073741877_1064; getBlockSize()=8630; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.23.147.41:9866,DS-8a2a0685-0c29-4001-b047-a0c1d8821d83,DISK], DatanodeInfoWithStorage[10.23.47.190:9866,DS-b8b148bb-9418-4d84-897f-5e5904afcb56,DISK], DatanodeInfoWithStorage[10.23.214.139:9866,DS-9a41b9ff-20e3-49d7-a752-7ef248e85aa0,DISK]]
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:362)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:200)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:184)
at org.apache.hadoop.hdfs.DFSClient.openInternal(DFSClient.java:1043)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1006)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:57)
at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:156)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:417)
... 8 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 49 Reduce: 1 Cumulative CPU: 93.14 sec HDFS Read: 2365415 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 1 minutes 33 seconds 140 msec
报错分析
报错是在执行hive查询的时候报的,从报错我们可以看出这是一个map reduce阶段读取数据时候报错,是读取文件数据块异常的问题,我们可以理解为数据文件块损坏或丢失问题。
遇到这个问题是因为之前升级HA集群关闭了较长一段时间,前端采集数据使用的是flume,关闭集群导致flume数据到hdfs文件没有正常关闭。因此在造成这个问题。因此我们需要找到没有正常关闭的文件并将他们修复。
问题解决
1、通过命令hdfs fsck /user/hive -openforwrite 检查哪些文件处于打开写的过程一直未关闭(根据自己实际要检查的目录来,如果确定是哪个目录下的话,范围可以更精准),可以看到有一些文件处于打开状态。但是这里有些文件是当前正在写的文件,我们要处理的是非当日的异常文件,当天的文件是正常的不要处理。
hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite
得到如下信息:
[xxx@xxxx ~]$ hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite
2022-06-10 15:04:25,074 INFO [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-06-10 15:04:25,096 INFO [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Connecting to namenode via http://dw-name-1:9870/fsck?ugi=xpc&openforwrite=1&path=%2Fbig-data%2Fods%2Fcommon%2Fods_sc_base_log
FSCK started by xpc (auth:SIMPLE) from /10.23.202.0 for path /big-data/ods/common/ods_sc_base_log at Fri Jun 10 15:04:25 CST 2022
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo 8630 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo 1866 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo 1154 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo 830 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo 2223 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo 1323 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo 10487 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-10/action_log_.1654842041420.lzo.tmp 863 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
Status: HEALTHY
Number of data-nodes: 4
Number of racks: 1
Total dirs: 5
Total symlinks: 0
Replicated Blocks:
Total size: 2070105 B
Total files: 50
Total blocks (validated): 50 (avg. block size 41402 B)
Minimally replicated blocks: 42 (84.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.52
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Fri Jun 10 15:04:25 CST 2022 in 2 milliseconds
The filesystem under path '/big-data/ods/common/ods_sc_base_log' is HEALTHY
我们发现文件件处于异常:
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo
2、通过命令修复对应文件
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654758627048.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766055769.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654766578649.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767187931.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654767521725.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768048999.lzo -retries 3
hdfs debug recoverLease -path /big-data/ods/common/ods_sc_base_log/date_key=2022-06-09/action_log_.1654768633070.lzo -retries 3
3、我们在通过命令hdfs fsck /user/hive -openforwrite 检查文件
hdfs fsck /user/hive -openforwrite
[xxx@xxx ~]$ hdfs fsck /big-data/ods/common/ods_sc_base_log/ -openforwrite
2022-06-10 15:33:40,877 INFO [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-06-10 15:33:40,899 INFO [main] Configuration.deprecation (Configuration.java:logDeprecation(1395)) - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Connecting to namenode via http://dw-name-1:9870/fsck?ugi=xpc&openforwrite=1&path=%2Fbig-data%2Fods%2Fcommon%2Fods_sc_base_log
FSCK started by xpc (auth:SIMPLE) from /10.23.202.0 for path /big-data/ods/common/ods_sc_base_log at Fri Jun 10 15:33:41 CST 2022
/big-data/ods/common/ods_sc_base_log/date_key=2022-06-10/action_log_.1654845647673.lzo.tmp 625 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
Status: HEALTHY
Number of data-nodes: 4
Number of racks: 1
Total dirs: 5
Total symlinks: 0
Replicated Blocks:
Total size: 2536981 B
Total files: 51
Total blocks (validated): 51 (avg. block size 49744 B)
Minimally replicated blocks: 50 (98.039215 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.9411764
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Fri Jun 10 15:33:41 CST 2022 in 1 milliseconds
The filesystem under path '/big-data/ods/common/ods_sc_base_log' is HEALTHY
我们发现文件已经修复成功。
4、重新执行查询发现任务正常并得到结果
hive (test)> select date_key ,count(1) from test.ods_base_log_inc_test group by date_key order by date_key;
Query ID = xpc_20220610154222_f383d2c4-c31c-40c6-bbaf-f748b8fe37df
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1654770980875_0010, Tracking URL = http://dw-name-1:8088/proxy/application_1654770980875_0010/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1654770980875_0010
Hadoop job information for Stage-1: number of mappers: 50; number of reducers: 1
2022-06-10 15:42:29,416 Stage-1 map = 0%, reduce = 0%
2022-06-10 15:42:36,573 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 39.12 sec
2022-06-10 15:42:37,594 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 78.68 sec
2022-06-10 15:42:38,615 Stage-1 map = 92%, reduce = 0%, Cumulative CPU 98.9 sec
2022-06-10 15:42:39,634 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 104.24 sec
2022-06-10 15:42:40,654 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 106.04 sec
2022-06-10 15:42:41,678 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 108.53 sec
MapReduce Total cumulative CPU time: 1 minutes 48 seconds 530 msec
Ended Job = job_1654770980875_0010
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1654770980875_0011, Tracking URL = http://dw-name-1:8088/proxy/application_1654770980875_0011/
Kill Command = /opt/module/hadoop-3.1.3/bin/mapred job -kill job_1654770980875_0011
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2022-06-10 15:42:50,025 Stage-2 map = 0%, reduce = 0%
2022-06-10 15:42:56,146 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec
2022-06-10 15:43:01,246 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.63 sec
MapReduce Total cumulative CPU time: 3 seconds 630 msec
Ended Job = job_1654770980875_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 50 Reduce: 1 Cumulative CPU: 108.53 sec HDFS Read: 2926070 HDFS Write: 285 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 3.63 sec HDFS Read: 7746 HDFS Write: 273 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 52 seconds 160 msec
OK
date_key _c1
2022-06-07 1474
2022-06-08 4702
2022-06-09 6026
2022-06-10 3520
Time taken: 39.342 seconds, Fetched: 4 row(s)
以上是关于重启集群造成块丢失问题org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for L的主要内容,如果未能解决你的问题,请参考以下文章