CDH之HIVE-ON-SPARKSpark配置

Posted 小基基o_O

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CDH之HIVE-ON-SPARKSpark配置相关的知识,希望对你有一定的参考价值。

文章目录

HIVE ON SPARK配置

CDH6.3.2的HIVE版本为:2.1.1+cdh6.3.2

HIVE默认引擎

hive.execution.engine

Driver配置

spark.driver

配置名称说明建议
spark.driver.memory用于Driver进程的内存YARN可分配总内存的10%
spark.driver.memoryOverhead集群模式下每个Driver进程的堆外内存 D r i v e r 内存 × 0.1 Driver内存 \\times 0.1 Driver内存×0.1
spark.yarn.driver.memoryOverheadspark.driver.memoryOverhead差不多,YARN场景专用 A M 内存 × 0.1 AM内存 \\times 0.1 AM内存×0.1
spark.driver.cores集群模式下,用于Driver进程的核心数

Executor配置

spark.executor

配置名称说明建议
spark.executor.cores单个Executor的CPU核数4
spark.executor.memoryExecutor进程的堆内存大小,用于数据的计算和存储
spark.executor.memoryOverheadExecutor进程的堆外内存,用于JVM的额外开销,操作系统开销等spark.executor.memoryOverhead=
spark.executor.memory × \\times × 0.1
spark.executor.instances静态分配executor数量不使用静态分配

Spark on YARN的内存模型

Executor数量动态分配

spark.dynamicAllocation

配置名称说明建议
spark.dynamicAllocation.enabled是否启用 Executor个数动态调配启用
spark.dynamicAllocation.initialExecutors初始Executor个数
spark.dynamicAllocation.minExecutors最少Executor个数1
spark.dynamicAllocation.maxExecutors最多Executor个数
spark.dynamicAllocation.executorIdleTimeoutExecutor闲置超时就会被移除默认60秒
spark.dynamicAllocation.schedulerBacklogTimeout待处理的任务积压超时就会申请启动新的Executor默认1秒
  • 假设某节点 NM 有16个核可供Executor使用
    spark.executor.core配置为4,则该节点最多可启动4个Executor
    spark.executor.core配置为5,则该节点最多可启动3个Executor,会剩余1个核未使用
  • Executor个数的指定方式有两种:静态分配动态分配
    • 动态分配可根据一个Spark应用的工作负载,动态地调整Executor数量
      资源不够时增加Executor,Executor不工作时将被移除
      启用方式是spark.dynamicAllocation.enabled设为true

Spark配置

CDH6.3.2的Spark版本为:2.4.0+cdh6.3.2

shuffle服务

  • 启用了动态分配Executor数量的情况下,shuffle服务允许删除Executor时保留其编写的shuffle文件
    每个工作节点上都要设置外部shuffle服务

spark.shuffle.service

Property Name说明版本始于建议
spark.shuffle.service.enabled启用额外shuffle服务,可保存Executor写的shuffle文件,从而可以安全移除工作完的Executor,或者在Executor失败的情况下继续获取shuffle文件1.2.0开启
spark.shuffle.service.port外部shuffle服务的端口1.2.0默认

配置建议

  • 大数据集群分为管理节点和工作节点,建议:
    管理节点的逻辑核数:内存(G)=1:2或1:4
    工作节点的逻辑核数:内存(G)=1:4或1:8

  • 建议给到 NM 约80%服务器资源,例如服务器有128G和32核,则:
    yarn.nodemanager.resource.memory-mb可给100G
    yarn.nodemanager.resource.cpu-vcores可给25

工作节点p101p102p103p104max
内存(G)128128128128512128
逻辑CPU个数(虚拟核心数)3232323212832
NM 可分配内存(G)yarn.nodemanager.resource.memory-mb100100100100400100
NM 可分配虚拟核心数yarn.nodemanager.resource.cpu-vcores2525252510025
  • MapReduce
    AM内存:12G
    AM虚拟核心数:3
    Map内存:20G(单节点yarn.nodemanager.resource.memory-mb的约数)
    Map虚拟核心数:5(单节点yarn.nodemanager.resource.cpu-vcores的约数)
    Reduce内存:20G
    Reduce虚拟核心数:5
  • Spark
    spark.driver.memory:10.8G
    spark.driver.memoryOverhead:1.2G
    spark.executor.memory:18G
    spark.executor.memoryOverhead:2G
    spark.executor.cores:5

Spark On YARN Cluster,Driver在ApplicationMaster中启动,Driver内存应小于AM内存

yarn.nodemanager.resource.memory-mb>50G,此时建议Driver内存为12G

附录

🔉
idleˈaɪd(ə)ladj. 无事可做的;闲置的;v. 无所事事;(发动机、车辆)空转
overheadˌoʊvərˈhedadv. 在头顶上方;adj. 头顶上的;n. 营运费用;日常管理费;间接费用
backlogˈbæklɔːɡn. 积压的工作
pendingˈpendɪŋadj. 待定的,待处理的;即将发生的;prep. 直到……为止:v. 等候判定或决定
pendpendv. 等候判定;悬挂

原文地址:spark.apache.org/docs/latest=>Configuration

Application PropertiesDefault MeaningSinceVersion
spark.executor.memory1gAmount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g).0.7.0
spark.executor.memoryOverheadexecutorMemory * 0.10, with minimum of 384Amount of additional memory to be allocated per executor process, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes.

Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.
2.3.0
Dynamic AllocationDefault MeaningSinceVersion
spark.dynamicAllocation.enabledfalseWhether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
This requires spark.shuffle.service.enabled or spark.dynamicAllocation.shuffleTracking.enabled to be set. The following configurations are also relevant: spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.maxExecutors, and spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.executorAllocationRatio
1.2.0
spark.dynamicAllocation.executorIdleTimeout60sIf dynamic allocation is enabled and an executor has been idle for more than this duration, the executor will be removed. For more detail, see this description.1.2.0
spark.dynamicAllocation.cachedExecutorIdleTimeoutinfinityIf dynamic allocation is enabled and an executor which has cached data blocks has been idle for more than this duration, the executor will be removed.1.4.0
spark.dynamicAllocation.initialExecutorsspark.dynamicAllocation.minExecutorsInitial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger than this value, it will be used as the initial number of executors.
1.3.0
spark.dynamicAllocation.maxExecutorsinfinityUpper bound for the number of executors if dynamic allocation is enabled.1.2.0
spark.dynamicAllocation.minExecutors0Lower bound for the number of executors if dynamic allocation is enabled.1.2.0
spark.dynamicAllocation.executorAllocationRatio1By default, the dynamic allocation will request enough executors to maximize the parallelism according to the number of tasks to process. While this minimizes the latency of the job, with small tasks this setting can waste a lot of resources due to executor allocation overhead, as some executor might not even do any work. This setting allows to set a ratio that will be used to reduce the number of executors w.r.t. full parallelism. Defaults to 1.0 to give maximum parallelism. 0.5 will divide the target number of executors by 2 The target number of executors computed by the dynamicAllocation can still be overridden by the spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors settings2.4.0
spark.dynamicAllocation.schedulerBacklogTimeout1sIf dynamic allocation is enabled and there have been pending tasks backlogged for more than this duration, new executors will be requested.1.2.0
spark.dynamicAllocation.sustainedSchedulerBacklogTimeoutschedulerBacklogTimeoutSame as spark.dynamicAllocation.schedulerBacklogTimeout, but used only for subsequent executor requests.1.2.0
spark.dynamicAllocation.shuffleTracking.enabledfalseExperimental. Enables shuffle file tracking for executors, which allows dynamic allocation without the need for an external shuffle service. This option will try to keep alive executors that are storing shuffle data for active jobs.3.0.0
spark.dynamicAllocation.shuffleTracking.timeoutinfinityWhen shuffle tracking is enabled, controls the timeout for executors that are holding shuffle data. The default value means that Spark will rely on the shuffles being garbage collected to be able to release executors. If for some reason garbage collection is not cleaning up shuffles quickly enough, this option can be used to control when to time out executors even when they are storing shuffle data.3.0.0

以上是关于CDH之HIVE-ON-SPARKSpark配置的主要内容,如果未能解决你的问题,请参考以下文章

大数据之—CDH搭建

CDH安装时,无法纳管全部的节点的一个bug

CDH集群之YARN性能调优

CDH大数据平台 30Cloudera Manager Console之superset之redisldap配置(markdown新版四)

CDH6.3.2之Kafka配置和命令

CDH6.3.2之Kafka配置和命令