WARN ReliableDeliverySupervisor:与远程系统的关联失败,地址现在被门控 [5000] 毫秒。原因:[解除关联]
Posted
技术标签:
【中文标题】WARN ReliableDeliverySupervisor:与远程系统的关联失败,地址现在被门控 [5000] 毫秒。原因:[解除关联]【英文标题】:WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated] 【发布时间】:2015-10-15 18:09:34 【问题描述】:我在 aws spark 上运行下面这句话
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Wiki(project: String, title: String, count: Int, byte_size: String)
val data = sc.textFile("s3n://+++/").map(_.split(" ")).filter(_.size ==4 ).map(p => Wiki(p(0), p(1), p(2).trim.toInt, p(3)))
val df = data.toDF()
df.printSchema()
val en_agg_df = df.filter("project = 'en'").select("title","count").groupBy("title").sum().collect()
运行大约 2 小时后会出现以下错误:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@172.31.14.190:42514] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/15 17:38:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.31.14.190:42514
15/10/15 17:38:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.31.14.190:42514
15/10/15 17:38:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@ip-172-31-14-190.ap-northeast-1.compute.internal:43340] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/15 17:38:36 ERROR YarnScheduler: Lost executor 1 on ip-172-31-14-190.ap-northeast-1.compute.internal: remote Rpc client disassociated
15/10/15 17:38:36 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0
15/10/15 17:38:36 WARN TaskSetManager: Lost task 4736.0 in stage 0.0 (TID 4736, ip-172-31-14-190.ap-northeast-1.compute.internal): ExecutorLostFailure (executor 1 lost)
15/10/15 17:38:36 INFO DAGScheduler: Executor lost: 1 (epoch 0)
15/10/15 17:38:36 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
15/10/15 17:38:36 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-172-31-14-190.ap-northeast-1.compute.internal, 58890)
15/10/15 17:38:36 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
15/10/15 17:38:36 ERROR YarnScheduler: Lost executor 2 on ip-172-31-14-190.ap-northeast-1.compute.internal: remote Rpc client disassociated
15/10/15 17:38:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@ip-172-31-14-190.ap-northeast-1.compute.internal:60961] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/15 17:38:36 INFO TaskSetManager: Re-queueing tasks for 2 from TaskSet 0.0
15/10/15 17:38:36 WARN TaskSetManager: Lost task 4735.0 in stage 0.0 (TID 4735, ip-172-31-14-190.ap-northeast-1.compute.internal): ExecutorLostFailure (executor 2 lost)
15/10/15 17:38:36 INFO DAGScheduler: Executor lost: 2 (epoch 0)
15/10/15 17:38:36 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
15/10/15 17:38:36 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, ip-172-31-14-190.ap-northeast-1.compute.internal, 58811)
15/10/15 17:38:36 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
这是什么意思?我该如何解决?
【问题讨论】:
执行程序可能内存不足。因此,您需要检查容器日志以查找丢失的执行程序之一,并可能在它运行的节点上检查 yarn nodemanager 日志。 @ChristopherB 非常感谢您的评论! 是否成功或找到更多错误信息? @ChristopherB 我认为你是对的。这似乎是执行程序的内存不足,因为如果我向集群添加更多机器,它会顺利进行 【参考方案1】:cmets 中似乎已经提供了答案:
这似乎是执行程序的内存不足,因为如果我运行它会很顺利 向集群添加更多机器
【讨论】:
以上是关于WARN ReliableDeliverySupervisor:与远程系统的关联失败,地址现在被门控 [5000] 毫秒。原因:[解除关联]的主要内容,如果未能解决你的问题,请参考以下文章
WARN [LocalManagedConnectionFactory:cleanup] 清理期间拥有的锁:
[TestNG] [WARN] Ignoring duplicate listener : org.testng.IDEATestNGRemoteListenerEx
Mysql报警告:WARN: Establishing SSL connection