总结hadoop 磁盘满导致集群宕机排查解决
Posted 常乐_smile
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了总结hadoop 磁盘满导致集群宕机排查解决相关的知识,希望对你有一定的参考价值。
hadoop 集群因磁盘满了,导致服务挂掉,甚至有机器宕机。
当机器重启后,启动nameNode 和 journalNode 有报错。
1.启动 Namenode 错误信息
2023-03-27 10:23:23,484 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 20 on 8020, call Call#26630 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 192.17.128.187:21662
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1404)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2850)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1065)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:641)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol
2.
c
a
l
l
B
l
o
c
k
i
n
g
M
e
t
h
o
d
(
C
l
i
e
n
t
N
a
m
e
n
o
d
e
P
r
o
t
o
c
o
l
P
r
o
t
o
s
.
j
a
v
a
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine
2.callBlockingMethod(ClientNamenodeProtocolProtos.java)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer
P
r
o
t
o
B
u
f
R
p
c
I
n
v
o
k
e
r
.
c
a
l
l
(
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
.
j
a
v
a
:
503
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
R
P
C
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.Server
R
p
c
C
a
l
l
.
r
u
n
(
S
e
r
v
e
r
.
j
a
v
a
:
871
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
S
e
r
v
e
r
RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server
RpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1412)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1400)
… 12 more
2.启动 journalNode 错误信息
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1157)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/ns74/current/edits_inprogress_0000000000075656572 while determining its valid length. Position was 1024000
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp
L
e
g
a
c
y
R
e
a
d
e
r
.
s
c
a
n
O
p
(
F
S
E
d
i
t
L
o
g
O
p
.
j
a
v
a
:
4913
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
s
c
a
n
N
e
x
t
O
p
(
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
j
a
v
a
:
245
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
F
S
E
d
i
t
L
o
g
L
o
a
d
e
r
.
s
c
a
n
E
d
i
t
L
o
g
(
F
S
E
d
i
t
L
o
g
L
o
a
d
e
r
.
j
a
v
a
:
1153
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
s
c
a
n
E
d
i
t
L
o
g
(
E
d
i
t
L
o
g
F
i
l
e
I
n
p
u
t
S
t
r
e
a
m
.
j
a
v
a
:
329
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
d
f
s
.
s
e
r
v
e
r
.
n
a
m
e
n
o
d
e
.
F
i
l
e
J
o
u
r
n
a
l
M
a
n
a
g
e
r
LegacyReader.scanOp(FSEditLogOp.java:4913) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager
LegacyReader.scanOp(FSEditLogOp.java:4913)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:548)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:155)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:97)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:106)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:201)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService
2.
c
a
l
l
B
l
o
c
k
i
n
g
M
e
t
h
o
d
(
Q
J
o
u
r
n
a
l
P
r
o
t
o
c
o
l
P
r
o
t
o
s
.
j
a
v
a
:
25431
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngine
2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer
P
r
o
t
o
B
u
f
R
p
c
I
n
v
o
k
e
r
.
c
a
l
l
(
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
.
j
a
v
a
:
503
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
R
P
C
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC
ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.Server
R
p
c
C
a
l
l
.
r
u
n
(
S
e
r
v
e
r
.
j
a
v
a
:
871
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
S
e
r
v
e
r
RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server
RpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000
3.解决思路
首先需要现将有错误的jouranlNode 节点恢复,再恢复namenode。当两个节点问题都修复完成后,才可以重启其他服务。
jouranlNode 节点恢复,需要拷贝其他没有报错的jouranlNode 节点完整的jouranl目录来恢复。
4.恢复journalNode
1)将其他没有错误的journalNode节点目录,完成拷贝过来。
146机器journalNode 是好的,将146机器journalNode 安装目录打包发送到74上。
cd /hadoop/hdfs/
sudo tar -zcvf journal.tar.gz journal
scp journal.tar.gz admin@192.17.128.74:/home/admin/
2)登录74机器
cp /home/admin/journal.tar.gz /hadoop/hdfs/journal.tar.gz
备份原journal
mv journal journal-bak
sudo tar -zxvf journal.tar.gz
3)重启journal 服务
5.恢复nameNode
确保journal服务已恢复正常。执行如下操作:
1)切换到hdfs用户下
sudo su - hdfs
2)将hdfs 设置为脱离安全模式
hdfs dfsadmin -safemode leave
3)重启nameNode
当nameNode 重启成功后,再启动集群其他服务,比如yarn、hive、hbase。
此时,集群状态一般都会恢复。
磁盘inode节点被占满的解决方法
Linux服务器,查看日志发现程序无法继续写文件,但是用df -h查看磁盘容量还有剩余。
排查思路:怀疑是机器的inode节点被占满,使用df -i查看磁盘inode节点使用情况,果然是inode节点满了。
进行如下步骤进行排查:
1,df -i查看磁盘节点使用情况,查看到inode节点已满。
[[email protected] aig_sg_automation_test]$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-LogVol00
3000592 462749 2537843 16% /
tmpfs 1023253 5 1023248 1% /dev/shm
/dev/xvda1 51200 42 51158 1% /boot
172.16.29.199:/backup100/SoftWare
4487457984 552476 4486905508 1% /mnt
2,进入到可能的目录,运行for i in ./*; do echo i;findi;findi | wc -l; done统计当前目录使用节点的情况
3,发现
/usr/local/was/IBM/WebSphere/AppServer/profiles/aig_sg_automation_test/EBAO_ARCH_HOME/print_archive_data/document/pdf
每天有10000+ pdf生成,直接删除历史文件,
。
以上是关于总结hadoop 磁盘满导致集群宕机排查解决的主要内容,如果未能解决你的问题,请参考以下文章
工作中Hadoop,Spark,Phoenix,Impala 集群中遇到坑及解决方案