spark程序与参数的关系

Posted pearsonlee

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark程序与参数的关系相关的知识,希望对你有一定的参考价值。

What is spark.python.worker.memory?

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors?

When running Spark on YARN, each Spark executor runs as a YARN container
所以说,--executor-memory <= yarn.scheduler.maximum-allocation-mb(一个container的最大值)

yarn.scheduler.maximum-allocation-mb <= yarn.nodemanager.resource.memory-mb (每个节点yarn可以使用的内存资源上线)

--executor-memory <= yarn.scheduler.maximum-allocation-mb(一个container的最大值) <= yarn.nodemanager.resource.memory-mb (每个节点yarn可以使用的内存资源上线)

execuoterNum = spark.cores.max/spark.executor.cores

每个executor上可以执行多少个task
taskNum = spark.executor.cores/ spark.task.cpus

spark.python.worker.memory is a subset of the memory from spark.executor.memory
spark.python.worker.memory is used for Python worker in executor

spark.python.worker.memory <= spark.executor.memory(--executor-memory)

Because of GIL, pyspark use multiple python process in the executor, one for each task.
spark.python.worker.memory will tell the python worker to when to
spill the data into disk.

If you have enough memory in executor, increase spark.python.worker.memory will
let python worker to use more memory during shuffle.which will increase the performance.

综上, pyspark运行时会在executor中起多个python进程task,每个task多少内存由spark.python.worker.memory控制
那么什么参数控制一个executor中其多少个python-task?只需要控制spark.python.worker.memory就可以吗?会exector/worker?

一个Executor上同时运行多少个Task,就会有多少个对应的pyspark.worker进程

spark.yarn.executor.memoryoverhead 的内存从哪里去?与spark.executor.memory和container的关系是什么?

以上是关于spark程序与参数的关系的主要内容,如果未能解决你的问题,请参考以下文章

随机森林算法demo python spark

spark 并行度

spark 内存管理机制与相关参数调优

配置 spark 应用程序参数的最佳策略是啥?

Spark-submit参数详解

spark streaming 程序设置jvm参数