搭建Spark所遇过的坑
Posted 格格巫 MMQ!!
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了搭建Spark所遇过的坑相关的知识,希望对你有一定的参考价值。
一.经验
1.Spark Streaming包含三种计算模式:nonstate .stateful .window
2.kafka可通过配置文件使用自带的zookeeper集群
3.Spark一切操作归根结底是对RDD的操作
4.部署Spark任务,不用拷贝整个架包,只需拷贝被修改的文件,然后在目标服务器上编译打包。
5.kafka的log.dirs不要设置成/tmp下的目录,貌似tmp目录有文件数和磁盘容量限制
6.ES的分片类似kafka的partition
7spark Graph根据边集合构建图,顶点集合只是指定图中哪些顶点有效
8.presto集群没必要采用on yarn模式,因为hadoop依赖HDFS,如果部分机器磁盘很小,hadoop会很尴尬,而presto是纯内存计算,不依赖磁盘,独立安装可以跨越多个集群,可以说有内存的地方就可以有presto
9.presto进程一旦启动,JVM server会一直占用内存
10.如果maven下载很慢,很可能是被天朝的GFW墙了,可以在maven安装目录的setting.conf配置文件mirrors标签下加入国内镜像抵制**党的网络封锁,例如:
nexus-aliyun * Nexus aliyun http://maven.aliyun.com/nexus/content/groups/public 11.编译spark,hive on spark就不要加-Phive参数,若需sparkSQL支持hive语法则要加-Phive参数12.通过hive源文件pom.xml查看适配的spark版本,只要打版本保持一致就行,例如spark1.6.0和1.6.2都能匹配
13.打开Hive命令行客户端,观察输出日志是否有打印“SLF4J: Found binding in [jar:file:/work/poa/hive-2.1.0-bin/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]”来判断hive有没有绑定spark
14.kafka的comsumer groupID对于spark direct streaming无效
15.shuffle write就是在一个stage结束计算之后,为了下一个stage可以执行shuffle类的算子,而将每个task处理的数据按key进行分类,将相同key都写入同一个磁盘文件中,而每一个磁盘文件都只属于下游stage的一个task,在将数据写入磁盘之前,会先将数据写入内存缓存中,下一个stage的task有多少个,当前stage的每个task就要创建多少份磁盘文件。
16.单个spark任务的excutor核数不宜设置过高,否则会导致其他JOB延迟
17.数据倾斜只发生在shuffle过程,可能触发shuffle操作的算子有:distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition等
18.运行时删除hadoop数据目录会导致依赖HDFS的JOB失效
19.sparkSQL UDAF中update函数的第二个参数 input: Row 对应的并非DataFrame的行,而是被inputSchema投影了的行
20.Spark的Driver只有在Action时才会收到结果
21.Spark需要全局聚合变量时应当使用累加器(Accumulator)
22.Kafka以topic与consumer group划分关系,一个topic的消息会被订阅它的消费者组全部消费,如果希望某个consumer使用topic的全部消息,可将该组只设一个消费者,每个组的消费者数目不能大于topic的partition总数,否则多出的consumer将无消可费
23.所有自定义类要实现serializable接口,否则在集群中无法生效
24.resources资源文件读取要在Spark Driver端进行,以局部变量方式传给闭包函数
25.DStream流转化只产生临时流对象,如果要继续使用,需要一个引用指向该临时流对象
26.提交到yarn cluster的作业不能直接print到控制台,要用log4j输出到日志文件中
27.HDFS文件路径写法为:hdfs://master:9000/文件路径,这里的master是namenode的hostname,9000是hdfs端口号。
28.不要随意格式化HDFS,这会带来数据版本不一致等诸多问题,格式化前要清空数据文件夹
29.搭建集群时要首先配置好主机名,并重启机器让配置的主机名生效
30.linux批量多机互信, 将pub秘钥配成一个
31小于128M的小文件都会占据一个128M的BLOCK,合并或者删除小文件节省磁盘空间
32.Non DFS Used指的是非HDFS的所有文件
33.spark两个分区方法coalesce和repartition,前者窄依赖,分区后数据不均匀,后者宽依赖,引发shuffle操作,分区后数据均匀
34.spark中数据写入ElasticSearch的操作必须在action中以RDD为单位执行
35.可以通过hive-site.xml修改spark.executor.instances, spark.executor.cores, spark.executor.memory等配置来优化hive on spark执行性能,不过最好配成动态资源分配。
二.基本功能
0.常见问题:
1如果运行程序出现错误:Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory,这是因为项目缺少slf4j-api.jar和slf4j-log4j12.jar这两个jar包导致的错误。 2如果运行程序出现错误:java.lang.NoClassDefFoundError: org/apache/log4j/LogManager,这是因为项目缺少log4j.jar这个jar包 3错误:Exception in thread “main” java.lang.NoSuchMethodError: org.slf4j.MDC.getCopyOfContextMap()Ljava/util/Map,这是因为jar包版本冲突造成的。
1.配置spark-submit (CDH版本)
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.handleUnknown(SparkSubmitArguments.scala:451)
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:178)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:113)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader
1.
r
u
n
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
355
)
a
t
j
a
v
a
.
s
e
c
u
r
i
t
y
.
A
c
c
e
s
s
C
o
n
t
r
o
l
l
e
r
.
d
o
P
r
i
v
i
l
e
g
e
d
(
N
a
t
i
v
e
M
e
t
h
o
d
)
a
t
j
a
v
a
.
n
e
t
.
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
f
i
n
d
C
l
a
s
s
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
354
)
a
t
j
a
v
a
.
l
a
n
g
.
C
l
a
s
s
L
o
a
d
e
r
.
l
o
a
d
C
l
a
s
s
(
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
425
)
a
t
s
u
n
.
m
i
s
c
.
L
a
u
n
c
h
e
r
1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher
1.run(URLClassLoader.java:355)atjava.security.AccessController.doPrivileged(NativeMethod)atjava.net.URLClassLoader.findClass(URLClassLoader.java:354)atjava.lang.ClassLoader.loadClass(ClassLoader.java:425)atsun.misc.LauncherAppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 5 more
解决方案:
在spark-env.sh文件中添加:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
2.启动spark-shell时,报错
INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@services07:34965/user/Executor#1736210263] with ID 1
INFO util.RackResolver: Resolved services07 to /default-rack
INFO storage.BlockManagerMasterActor: Registering block manager services07:51154 with 534.5 MB RAM
解决方案:
在spark的spark-env配置文件中配置下列配置项:
将export SPARK_WORKER_MEMORY, export SPARK_DRIVER_MEMORY, export SPARK_YARN_AM_MEMORY的值设置成小于534.5 MB
3.启动spark SQL时,报错:
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver ") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
解决方案:
在$SPARK_HOME/conf/spark-env.sh文件中配置:
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.6-bin.jar
4.启动spark SQL时,报错:
java.sql.SQLException: Access denied for user 'services02 '@‘services02’ (using password: YES)
解决方案:
检查hive-site.xml的配置项, 有以下这个配置项
5.启动计算任务时报错:
报错信息为:
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
解决方案:
分配的core不够, 多分配几核的CPU
6.启动计算任务时报错:
不断重复出现
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:54,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:55,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:56,564 Stage-0_0: 0(+1)/1
解决方案:
资源不够, 分配大点内存, 默认值为512MB.
7.启动Spark作为计算引擎时报错:
报错信息为:
java.io.IOException: Failed on local exception: java.nio.channels.ClosedByInterruptException; Host Details : local host is: “m1/192.168.179.201”; destination host is: “m1”:9000;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
Caused by: java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:681)
17/01/06 11:01:43 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over m2/192.168.179.202:9000 after 9 fail over attempts. Trying to fail over immediately.
解决方案:
出现该问题的原因有多种, 我所遇到的是使用Hive On Spark时报了此错误,解决方案是: 在hive-site.xml文件下正确配置该项
8.启动spark集群时报错,启动命令为:start-mastersh
报错信息:
Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
… 7 more
解决方案:
将/home/centos/soft/hadoop/share/hadoop/common/lib目录下的slf4j-api-1.7.5.jar文件,slf4j-log4j12-1.7.5.jar文件和commons-logging-1.1.3.jar文件拷贝到/home/centos/soft/spark/lib目录下
9.启动spark集群时报错,启动命令为:start-mastersh
报错信息:
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2570)
at java.lang.Class.getMethod0(Class.java:2813)
at java.lang.Class.getMethod(Class.java:1663)
at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader
1.
r
u
n
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
355
)
a
t
j
a
v
a
.
s
e
c
u
r
i
t
y
.
A
c
c
e
s
s
C
o
n
t
r
o
l
l
e
r
.
d
o
P
r
i
v
i
l
e
g
e
d
(
N
a
t
i
v
e
M
e
t
h
o
d
)
a
t
j
a
v
a
.
n
e
t
.
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
f
i
n
d
C
l
a
s
s
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
354
)
a
t
j
a
v
a
.
l
a
n
g
.
C
l
a
s
s
L
o
a
d
e
r
.
l
o
a
d
C
l
a
s
s
(
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
425
)
a
t
s
u
n
.
m
i
s
c
.
L
a
u
n
c
h
e
r
1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher
1.run(URLClassLoader.java:355)atjava.security.AccessController.doPrivileged(NativeMethod)atjava.net.URLClassLoader.findClass(URLClassLoader.java:354)atjava.lang.ClassLoader.loadClass(ClassLoader.java:425)atsun.misc.LauncherAppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 6 more
解决方案:
官网资料:
https://spark.apache.org/docs/latest/hadoop-provided.html#apache-hadoop
编辑/home/centos/soft/spark/conf/spark-env.sh文件,配置下列配置项:
export SPARK_DIST_CLASSPATH=$(/home/centos/soft/hadoop/bin/hadoop classpath)
10.启动HPL/SQL存储过程时报错:
报错信息:
2017-01-10T15:20:18,491 ERROR [HiveServer2-Background-Pool: Thread-97] exec.TaskRunner: Error in executeTask
java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
2017-01-10T15:20:18,491 ERROR [HiveServer2-Background-Pool: Thread-97] ql.Driver: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. PermGen space
2017-01-10T15:20:18,491 INFO [HiveServer2-Background-Pool: Thread-97] ql.Driver: Completed executing command(queryId=centos_20170110152016_240c1b5e-3153-4179-80af-9688fa7674dd); Time taken: 2.113 seconds
2017-01-10T15:20:18,500 ERROR [HiveServer2-Background-Pool: Thread-97] operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. PermGen space
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:388)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:244)
at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
Caused by: java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
解决方案:
参考资料:
http://blog.csdn.net/xiao_jun_0820/article/details/45038205
出现该问题是因为Spark默认使用全部资源, 而此时主机的内存已用, 应在Spark配置文件中限制内存的大小. 在hive-site.xml文件下配置该项:
spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M
三.Spark常见问题汇总
1.报错信息:
Operation category READ is not supported in state standbyorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby
解决方案:
查看执行Spark计算的是否处于standby状态, 用浏览器访问该主机:http://m1:50070, 如果处于standby状态, 则不可在处于StandBy机器运行spark计算,应切执行Spark计算的主机从Standby状态切换到Active状态
2.问题出现情景:
Spakr集群的所有运行数据在Master重启是都会丢失
解决方案:
配置spark.deploy.recoveryMode选项为ZOOKEEPER
3.报错信息:
由于Spark在计算的时候会将中间结果存储到/tmp目录,而目前linux又都支持tmpfs,其实就是将/tmp目录挂载到内存当中, 那么这里就存在一个问题,中间结果过多导致/tmp目录写满而出现如下错误
No Space Left on the device(Shuffle临时文件过多)
解决办法:
修改配置文件spark-env.sh,把临时文件引入到一个自定义的目录中去, 即:
export SPARK_LOCAL_DIRS=/home/utoken/datadir/spark/tmp
4.报错信息:
java.lang.OutOfMemory, unable to create new native thread
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
解决方案:
上面这段错误提示的本质是Linux操作系统无法创建更多进程,导致出错,并不是系统的内存不足。因此要解决这个问题需要修改Linux允许创建更多的进程,就需要修改Linux最大进程数。 (1)修改Linux最大进程数
ulimit -a
(2)临时修改允许打开的最大进程数
ulimit -u 65535
(3)临时修改允许打开的文件句柄
ulimit -n 65535
(4)永久修改Linux最大进程数量
sudo vi /etc/security/limits.d/90-nproc.conf
-
soft nproc 60000
root soft nproc unlimited
永久修改用户打开文件的最大句柄数,该值默认1024,一般都会不够,常见错误就是not open file 解决办法:
sudo vi /etc/security/limits.conf
bdata soft nofile 65536
bdata hard nofile 65536
5.问题出现情景:
Worker节点中的work目录占用许多磁盘空间, 这些是Driver上传到worker的文件, 会占用许多磁盘空间.
解决方案:
需要定时做手工清理. 目录地址:/home/centos/soft/spark/work
6.问题出现情景:
spark-shell提交Spark Application如何解决依赖库
解决方案:
利用–driver-class-path选项来指定所依赖的jar文件,注意的是–driver-class-path后如果需要跟着多个jar文件的话,jar文件之间使用冒号:来分割。
7.Spark在发布应用的时候,出现连接不上master
报错信息如下:
INFO AppClient$ClientEndpoint: Connecting to master spark://s1:7077…
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@s1:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
解决方案:
检查所有机器时间是否一致.hosts是否都配置了映射.客户端和服务器端的Scala版本是否一致.Scala版本是否和Spark兼容
8.开发spark应用程序(和Flume-NG结合时)发布应用时可能会报错
报错信息如下:
ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.10.156:18800
ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 70)
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.10.156:18800
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
Caused by: java.net.BindException: Cannot assign requested address
解决方案:
参考资料:
http://www.tuicool.com/articles/Yfi2eyR
由于spark通过Master发布的时候,会自动选取发送到某一台的worker节点上,所以这里绑定端口的时候,需要选择相应的worker服务器,但是由于我们无法事先了解到,spark发布到哪一台服务器的,所以这里启动报错,是因为在192.168.10.156:18800的机器上面没有启动Driver程序,而是发布到了其他服务器去启动了,所以无法监听到该机器出现问题,所以我们需要设置spark分发包时,发布到所有worker节点机器,或者发布后,我们去寻找发布到了哪一台机器,重新修改绑定IP,重新发布,有一定几率发布成功。
9.使用Hive on Spark时报错:
ERROR XSDB6: Another instance of Derby may have already booted the database /home/bdata/data/metastore_db.
解决方案:
在使用Hive on Spark模式操作hive里面的数据时,报以上错误,原因是因为HIVE采用了derby这个内嵌数据库作为数据库,它不支持多用户同时访问,解决办法就是把derby数据库换成mysql数据库即可
10.找不到hdfs集群名字dfscluster
报错信息:
java.lang.IllegalArgumentException: java.net.UnknownHostException: dfscluster
解决办法:
将
H
A
D
O
O
P
H
O
M
E
/
e
t
c
/
h
a
d
o
o
p
/
h
d
f
s
−
s
i
t
e
.
x
m
l
文件拷贝到
S
p
a
r
k
集群的所有主机的
HADOOP_HOME/etc/hadoop/hdfs-site.xml文件拷贝到Spark集群的所有主机的
HADOOPHOME/etc/hadoop/hdfs−site.xml文件拷贝到Spark集群的所有主机的SPARK_HOME/conf目录下,然后重启Spark集群
cd /home/centos/soft/spark/conf/
for i in 201,202,203;
do scp hdfs-site.xml 192.168.179.$i:/home/centos/soft/spark/conf/;
done
11.在执行yarn集群或者客户端时,报错:
执行指令:
sh $SPARK_HOME/bin/spark-sql --master yarn-client
报如下错误:
Exception in thread “main” java.lang.Exception: When running with master ‘yarn-client’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
解决办法:
根据提示,配置HADOOP_CONF_DIR or YARN_CONF_DIR的环境变量即可, 在spark-env.sh文件中配置以下几项:
export HADOOP_HOME=/u01/hadoop-2.6.1
export HADOOP_CONF_DIR=
H
A
D
O
O
P
H
O
M
E
/
e
t
c
/
h
a
d
o
o
p
P
A
T
H
=
HADOOP_HOME/etc/hadoop PATH=
HADOOPHOME/etc/hadoopPATH=PATH:
H
I
V
E
H
O
M
E
/
b
i
n
:
HIVE_HOME/bin:
HIVEHOME/bin:HADOOP_HOME/bin
12.提交spark计算任务时,报错:
报错信息如下:
Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in
[org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 0 on 192.168.10.38: remote Rpc client disassociated
[org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 1 on 192.168.10.38: remote Rpc client disassociated
[org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 2 on 192.168.10.38: remote Rpc client disassociated
[org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 3 on 192.168.10.38: remote Rpc client disassociated
[org.apache.spark.scheduler.TaskSetManager]-[ERROR] Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread “main” org.apache.spark.SparkException : Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 14, 192.168.10.38): ExecutorLostFailure (executor 3 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
a
p
a
c
h
e
apache
apachespark
s
c
h
e
d
u
l
e
r
scheduler
schedulerDAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
解决方案:
这里遇到的问题主要是因为数据源数据量过大,而机器的内存无法满足需求,导致长时间执行超时断开的情况,数据无法有效进行交互计算,因此有必要增加内存
13.启动Spark计算任务:
长时间等待无反应,并且看到服务器上面的web界面有内存和核心数,但是没有分配,报错信息如下:
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1
日志信息显示:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
解决方案:
出现上面的问题主要原因是因为我们通过参数spark.executor.memory设置的内存过大,已经超过了实际机器拥有的内存,故无法执行,需要等待机器拥有足够的内存后,才能执行任务,可以减少任务执行内存,设置小一些即可
14.内存不足或数据倾斜导致Executor Lost(spark-submit提交)
报错信息如下:
TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.10.37): java.lang.OutOfMemoryError: Java heap space 以上是关于搭建Spark所遇过的坑的主要内容,如果未能解决你的问题,请参考以下文章
INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.10.37:57139 (size: 42.0 KB, free: 24.2 MB)
INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.10.38:53816 (size: 42.0 KB, free: 24.2 MB)
INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 102, 192.168.10.37, ANY, 2152 bytes)
WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.10.37): java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
at java.io.BufferedOutputStream.(BufferedOutputStream.java:59)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon
2.
<
i
n
i
t
>
(
U
n
s
a
f
e
R
o
w
S
e
r
i
a
l
i
z
e
r
.
s
c
a
l
a
:
55
)
E
R
R
O
R
T
a
s
k
S
c
h
e
d
u
l
e
r
I
m
p
l
:
L
o
s
t
e
x
e
c
u
t
o
r
6
o
n
192.168.10.37
:
r
e
m
o
t
e