总结hadoop 磁盘满导致集群宕机排查解决

Posted 常乐_smile

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了总结hadoop 磁盘满导致集群宕机排查解决相关的知识,希望对你有一定的参考价值。

hadoop 集群因磁盘满了,导致服务挂掉,甚至有机器宕机。
当机器重启后,启动nameNode 和 journalNode 有报错。

1.启动 Namenode 错误信息

2023-03-27 10:23:23,484 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 20 on 8020, call Call#26630 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 192.17.128.187:21662
org.apache.hadoop.ipc.RetriableException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1404)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2850)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1065)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:641)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol 2. c a l l B l o c k i n g M e t h o d ( C l i e n t N a m e n o d e P r o t o c o l P r o t o s . j a v a ) a t o r g . a p a c h e . h a d o o p . i p c . P r o t o b u f R p c E n g i n e 2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer P r o t o B u f R p c I n v o k e r . c a l l ( P r o t o b u f R p c E n g i n e . j a v a : 503 ) a t o r g . a p a c h e . h a d o o p . i p c . R P C ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.Server R p c C a l l . r u n ( S e r v e r . j a v a : 871 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server RpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /livy2-covery/v1/batch/6213. Name node is in safe mode.
The reported blocks 229819 needs additional 35467 blocks to reach the threshold 0.9900 of total blocks 267966.
The number of live datanodes 8 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. NamenodeHostName:ambari-d-146.te.td
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1412)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1400)
… 12 more

2.启动 journalNode 错误信息

2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1157)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/ns74/current/edits_inprogress_0000000000075656572 while determining its valid length. Position was 1024000
java.io.IOException: Can’t scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp L e g a c y R e a d e r . s c a n O p ( F S E d i t L o g O p . j a v a : 4913 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . E d i t L o g F i l e I n p u t S t r e a m . s c a n N e x t O p ( E d i t L o g F i l e I n p u t S t r e a m . j a v a : 245 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . F S E d i t L o g L o a d e r . s c a n E d i t L o g ( F S E d i t L o g L o a d e r . j a v a : 1153 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . E d i t L o g F i l e I n p u t S t r e a m . s c a n E d i t L o g ( E d i t L o g F i l e I n p u t S t r e a m . j a v a : 329 ) a t o r g . a p a c h e . h a d o o p . h d f s . s e r v e r . n a m e n o d e . F i l e J o u r n a l M a n a g e r LegacyReader.scanOp(FSEditLogOp.java:4913) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager LegacyReader.scanOp(FSEditLogOp.java:4913)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)atorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1153)atorg.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)atorg.apache.hadoop.hdfs.server.namenode.FileJournalManagerEditLogFile.scanLog(FileJournalManager.java:548)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)
at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:155)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:97)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:106)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:201)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService 2. c a l l B l o c k i n g M e t h o d ( Q J o u r n a l P r o t o c o l P r o t o s . j a v a : 25431 ) a t o r g . a p a c h e . h a d o o p . i p c . P r o t o b u f R p c E n g i n e 2.callBlockingMethod(QJournalProtocolProtos.java:25431) at org.apache.hadoop.ipc.ProtobufRpcEngine 2.callBlockingMethod(QJournalProtocolProtos.java:25431)atorg.apache.hadoop.ipc.ProtobufRpcEngineServer P r o t o B u f R p c I n v o k e r . c a l l ( P r o t o b u f R p c E n g i n e . j a v a : 503 ) a t o r g . a p a c h e . h a d o o p . i p c . R P C ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)atorg.apache.hadoop.ipc.RPCServer.call(RPC.java:989)
at org.apache.hadoop.ipc.Server R p c C a l l . r u n ( S e r v e r . j a v a : 871 ) a t o r g . a p a c h e . h a d o o p . i p c . S e r v e r RpcCall.run(Server.java:871) at org.apache.hadoop.ipc.Server RpcCall.run(Server.java:871)atorg.apache.hadoop.ipc.ServerRpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)
2023-03-27 10:18:17,443 WARN namenode.FSImage (FSEditLogLoader.java:scanEditLog(1162)) - After resync, position is 1024000

3.解决思路

首先需要现将有错误的jouranlNode 节点恢复,再恢复namenode。当两个节点问题都修复完成后,才可以重启其他服务。
jouranlNode 节点恢复,需要拷贝其他没有报错的jouranlNode 节点完整的jouranl目录来恢复。

4.恢复journalNode

1)将其他没有错误的journalNode节点目录,完成拷贝过来。
146机器journalNode 是好的,将146机器journalNode 安装目录打包发送到74上。

cd /hadoop/hdfs/
sudo tar -zcvf journal.tar.gz journal
scp journal.tar.gz admin@192.17.128.74:/home/admin/

2)登录74机器
cp /home/admin/journal.tar.gz /hadoop/hdfs/journal.tar.gz
备份原journal
mv journal journal-bak

sudo tar -zxvf journal.tar.gz

3)重启journal 服务

5.恢复nameNode

确保journal服务已恢复正常。执行如下操作:

1)切换到hdfs用户下
sudo su - hdfs

2)将hdfs 设置为脱离安全模式
hdfs dfsadmin -safemode leave

3)重启nameNode
当nameNode 重启成功后,再启动集群其他服务,比如yarn、hive、hbase。
此时,集群状态一般都会恢复。

磁盘inode节点被占满的解决方法

Linux服务器,查看日志发现程序无法继续写文件,但是用df -h查看磁盘容量还有剩余。

排查思路:怀疑是机器的inode节点被占满,使用df -i查看磁盘inode节点使用情况,果然是inode节点满了。

进行如下步骤进行排查:

1,df -i查看磁盘节点使用情况,查看到inode节点已满。

[[email protected] aig_sg_automation_test]$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-LogVol00
3000592 462749 2537843 16% /
tmpfs 1023253 5 1023248 1% /dev/shm
/dev/xvda1 51200 42 51158 1% /boot
172.16.29.199:/backup100/SoftWare
4487457984 552476 4486905508 1% /mnt

2,进入到可能的目录,运行for i in ./*; do echo i;findi;findi | wc -l; done统计当前目录使用节点的情况

3,发现

/usr/local/was/IBM/WebSphere/AppServer/profiles/aig_sg_automation_test/EBAO_ARCH_HOME/print_archive_data/document/pdf

 每天有10000+ pdf生成,直接删除历史文件,

 

以上是关于总结hadoop 磁盘满导致集群宕机排查解决的主要内容,如果未能解决你的问题,请参考以下文章

大数据实战——hadoop集群崩溃与故障的初始化恢复

hadoop集群防止磁盘损坏导致block丢失的解决方案

工作中Hadoop,Spark,Phoenix,Impala 集群中遇到坑及解决方案

linux中crontab定时任务导致磁盘满和云监控未报警的的坑

Ubuntu20.04根目录占满处理方法

磁盘inode节点被占满的解决方法