大数据集群yarn任务延迟问题调查
Posted 轻尘
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据集群yarn任务延迟问题调查相关的知识,希望对你有一定的参考价值。
一、背景
BI集群,有60多个节点,2P+数据,机器都已经运行了3年以上。
二、现象
1、yarn任务严重延迟,有时候甚至超时失败
2、yarn任务手动kill后重跑大多数时候会成功
三、调查思路
一开始怀疑是跑任务的当时资源不足导致,一直当作资源不足处理。
1、查看延迟任务日志
2、查看节点日志
3、分析任务执行的task以及container情况,发现有一个节点执行任务时耗时特别长
4、定位到可能是这台机器有问题,重点调查这台机器的问题:
通过跟踪节点的日志,yarn日志基本正常,hadoop datanode日志有异常,根据异常日志搜索出来的问题,没有结论。
2021-09-27 16:03:12,467 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378156_3517112822, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected
local=/10.xx.xx.xx:50010 remote=/10.204.114.146:55280]. 447424 millis timeout left.
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:06,623 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113 src: /10.ee.ee.ee:19545 dest: /10.xx.xx.xx:50010
2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:19545, dest: /10.xx.xx.xx:50010, bytes: 5730, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-857996562_138, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, duration: 604633141
2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2021-09-27 16:26:07,399 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run():
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714 received exception java.io.IOException: Connection reset by peer
2021-09-27 16:26:07,406 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:31015 dst: /10.xx.xx.xx:50010
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,560 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run():
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 received exception java.io.IOException: Connection reset by peer
2021-09-27 16:30:39,561 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:2043 dst: /10.xx.xx.xx:50010
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475389081_3517122514 src: /10.ee.ee.ee:34679 dest: /10.xx.xx.xx:50010
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 src: /10.ee.ee.ee:6123 dest: /10.xx.xx.xx:50010
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475378492_3517117022, RBW
getNumBytes() = 44179902
getBytesOnDisk() = 44179902
getVisibleLength()= 44179902
getVolume() = /data/dfs/data/current
getBlockFile() = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475378492
bytesAcked=44179902
bytesOnDisk=44179902
2021-09-27 18:27:02,533 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run():
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 received exception java.io.IOException: Premature EOF from inputStream
2021-09-27 18:27:02,536 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:3268 dst: /10.xx.xx.xx:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 (numBytes=6635939) to /10.216.5.16:50010
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 src: /10.ee.ee.ee:3340 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475546259_3517286755, RBW
getNumBytes() = 6635939
getBytesOnDisk() = 6635939
getVisibleLength()= 6635939
getVolume() = /data/dfs/data/current
getBlockFile() = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475546259
bytesAcked=6635939
bytesOnDisk=6635939
2021-09-27 18:27:03,630 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550789_3517286896 src: /10.ee.ee.ee:25889 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:03,670 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897 src: /10.ee.ee.ee:48873 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:03,673 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:48873, dest: /10.xx.xx.xx:50010, bytes: 19994, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1365176392_177806, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897, duration: 2383047
1)、怀疑有数据倾斜,某个任务用到的数据分布不均匀,都在这个节点上,导致处理量大,耗时特别长。后来排除这种情况。
2)、怀疑这台机器的配置有问题,检查各种配置,发现jdk版本不一致,这台机器上用的是openjdk1.8,其他机器用的oraclejdk1.8,也有的机器上用openjdk1.8,为了排除jdk的问题,把这台机器的jdk改为oraclejdk1.8,重启服务后问题依然存在。(过程中发现整个集群的jdk有好几个小版本,感慨一下,维护的同学在集群扩容的时候就不考虑跟之前保持一致吗?)
中间也怀疑过centos系统的linux内核版本不一致导致的,查看了一下没有问题,据说有个小版本是有问题的。
3)、因为机器很老了,怀疑这台机器硬盘有坏掉的情况,进行磁盘检测,硬件故障检测,各项硬件都正常,排除硬盘问题。
4)、重启这台机器,问题依然存在,但是有一个发现,这台机器重启的时候,集群任务跑的很快,基本可以断定是这台机器导致整个集群的问题。
5)、怀疑网络问题,一开始检查都很正常,后来一直ping,发现有1%的丢包。
6)、可能是网线或者网口的问题,最后换了一个光口(从光口A换到光口B)后问题解决。
四、结论
1、可以通过一直ping来判断网络是否正常。
2、网络问题可能会导致很多意想不到的异常。
yarn任务异常,异常日志在datanode日志里,最后发现是光口的问题。调查问题的结局总是让人意想不到。
以上是关于大数据集群yarn任务延迟问题调查的主要内容,如果未能解决你的问题,请参考以下文章