在 Spark SQL 中读取 40 万行时出现内存不足错误 [重复]
Posted
技术标签:
【中文标题】在 Spark SQL 中读取 40 万行时出现内存不足错误 [重复]【英文标题】:Out Of Memory Error while reading 400 thousand rows in Spark SQL [duplicate] 【发布时间】:2018-11-12 17:24:09 【问题描述】:我有一些关于 postgres 的数据,并试图在 spark 数据帧上读取该数据,但我收到错误 java.lang.OutOfMemoryError: GC overhead limit exceeded
。我正在使用内存为 8GB 的 PySpark。
下面是代码
import findspark
findspark.init()
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
temp_df = sql_context.read.format('jdbc').options(url="jdbc:postgresql://localhost:5432/database",
dbtable="table_name",
user="user",
password="password",
driver="org.postgresql.Driver").load()
我对火花世界很陌生。我对 python pandas 进行了同样的尝试,它没有任何问题,但使用 spark 我得到了错误。
Exception in thread "refresh progress" java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.VectorBuilder.<init>(Vector.scala:713)
at scala.collection.immutable.Vector$.newBuilder(Vector.scala:22)
at scala.collection.immutable.IndexedSeq$.newBuilder(IndexedSeq.scala:46)
at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply(ConsoleProgressBar.scala:89)
at org.apache.spark.ui.ConsoleProgressBar$$anonfun$3.apply(ConsoleProgressBar.scala:82)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:82)
at org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71)
at org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:56)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Exception in thread "RemoteBlock-temp-file-clean-thread" java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager.org$apache$spark$storage$BlockManager$RemoteBlockDownloadFileManager$$keepCleaning(BlockManager.scala:1648)
at org.apache.spark.storage.BlockManager$RemoteBlockDownloadFileManager$$anon$1.run(BlockManager.scala:1615)
2018-11-12 21:48:16 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
2018-11-12 21:48:16 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-11-12 21:48:16 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-11-12 21:48:16 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.OutOfMemoryError: GC overhead limit exceeded
2018-11-12 21:48:16 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
我的最终目标是使用 spark 对大型数据库表进行一些处理。任何帮助都会很棒。
【问题讨论】:
发布您的代码,它将帮助我们准确了解您要做什么。 @AmarGajbhiye 我添加了代码请看一下 【参考方案1】:我没有看到你的代码,只是增加了执行器的内存,例如。 spark.python.worker.memory
【讨论】:
我添加了代码..请看一下 我对pyspark不是很熟悉,但是我觉得你最好尝试设置spark.python.worker.memory=6g【参考方案2】:很抱歉,您的 RAM 似乎不够用。此外,spark 旨在用于具有大量数据(集群)的分布式系统,因此它可能不是您正在做的事情的最佳选择。
亲切的问候
编辑 正如@LiJianing 建议的那样,您可以增加火花执行器的内存。
from pyspark import SparkConf, SparkContext
conf = (SparkConf().set("spark.executor.memory", "8g"))
sc = SparkContext(conf = conf)
【讨论】:
你能指导我如何将数据从 Postgres 移动到 spark 环境,任何教程。 您可以使用Sqoop
将数据从Postgres
移动到Hive
,并使用Spark
进行处理,对吧?你可以,community.hortonworks.com/articles/14802/…
假设您使用的是 scala,我找到了您可能喜欢的东西。 ***.com/questions/24916852/…
400K 行不算什么。
设置执行器内存8g也报错以上是关于在 Spark SQL 中读取 40 万行时出现内存不足错误 [重复]的主要内容,如果未能解决你的问题,请参考以下文章
为啥'get_json_object'在spark和sql工具中运行时返回不同的结果
spark sql怎么去获取hive 表一定日期范围内的数据
插入数十万行时,MySQL 与 MS Access 相比非常慢