docker下的spark集群,调整参数榨干硬件

Posted 程序员欣宸

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了docker下的spark集群,调整参数榨干硬件相关的知识,希望对你有一定的参考价值。

欢迎访问我的GitHub

本篇概览

  • 本文是《docker下,极速搭建spark集群(含hdfs集群)》的续篇,前文将spark集群搭建成功并进行了简单的验证,但是存在以下几个小问题:

    1. spark只有一个work节点,只适合处理小数据量的任务,遇到大量数据的任务要消耗更多时间;
    2. hdfs的文件目录和docker安装目录在一起,如果要保存大量文件,很可能由于磁盘空间不足导致上传失败;
    3. master的4040和work的8080端口都没有开放,看不到job、stage、executor的运行情况;
  • 今天就来调整系统参数,解决上述问题;

最初的docker-compose.yml内容

  • 优化前的docker-compose.yml内容如下所示:
version: "2.2"
services:
  namenode:
    image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8
    container_name: namenode
    volumes:
      - hadoop_namenode:/hadoop/dfs/name
      - ./input_files:/input_files
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env
    ports:
      - 50070:50070

  resourcemanager:
    image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8
    container_name: resourcemanager
    depends_on:
      - namenode
      - datanode1
      - datanode2
    env_file:
      - ./hadoop.env

  historyserver:
    image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8
    container_name: historyserver
    depends_on:
      - namenode
      - datanode1
      - datanode2
    volumes:
      - hadoop_historyserver:/hadoop/yarn/timeline
    env_file:
      - ./hadoop.env

  nodemanager1:
    image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8
    container_name: nodemanager1
    depends_on:
      - namenode
      - datanode1
      - datanode2
    env_file:
      - ./hadoop.env

  datanode1:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode1
    depends_on:
      - namenode
    volumes:
      - hadoop_datanode1:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode2:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode2
    depends_on:
      - namenode
    volumes:
      - hadoop_datanode2:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode3:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode3
    depends_on:
      - namenode
    volumes:
      - hadoop_datanode3:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  master:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: master
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    links:
      - namenode
    expose:
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7077
      - 6066
    ports:
      - 6066:6066
      - 7077:7077
      - 8080:8080
    volumes:
      - ./conf/master:/conf
      - ./data:/tmp/data
      - ./jars:/root/jars

  worker:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 1g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
    ports:
      - 8081:8081
    volumes:
      - ./conf/worker:/conf
      - ./data:/tmp/data

volumes:
  hadoop_namenode:
  hadoop_datanode1:
  hadoop_datanode2:
  hadoop_datanode3:
  hadoop_historyserver:
  • 接下来开始优化;

实战环境信息

  • 本次实战所用的电脑是联想笔记本:
    1. CPU:i5-6300HQ(四核四线程)
    2. 内存:16G
    3. 硬盘:256G的NVMe再加500G机械硬盘
    4. 系统:Deepin15
    5. docker:18.09.1
    6. docker-compose:1.17.1
    7. spark:2.3.0
    8. hdfs:2.7.1

调整work节点数量

  • 由于内存有16G,于是打算将work节点数从1个调整到6个,调整后work容器的配置如下:
worker1:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker1
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker1
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
    volumes:
      - ./conf/worker1:/conf
      - ./data/worker1:/tmp/data
worker2:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker2
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker2
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8082
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
    volumes:
      - ./conf/worker2:/conf
      - ./data/worker2:/tmp/data
  • 如上所示,注意volumes参数,都映射在了docker-compose.yml同一层级的conf和data两个目录下,这里只贴出了worker1和worker2的内容,worker3-worker6的内容都是类似的;

hdfs的文件目录导致的磁盘空间不足问题

  • 先来看下hdfs的文件目录配置:
volumes:
      - hadoop_datanode1:/hadoop/dfs/data
  • 上面的hadoop_datanode1数据卷的配置在docker-compose.yml的最底部,是默认声明,如下:
volumes:
  hadoop_namenode:
  hadoop_datanode1:
  hadoop_datanode2:
  hadoop_datanode3:
  hadoop_historyserver:
  • 在容器运行状态,执行命令docker inspect datanode1查看容器信息,和数据卷相关的信息如下所示:
"Mounts": [
            
                "Type": "volume",
                "Name": "temp_hadoop_datanode1",
                "Source": "/var/lib/docker/volumes/temp_hadoop_datanode1/_data",
                "Destination": "/hadoop/dfs/data",
                "Driver": "local",
                "Mode": "rw",
                "RW": true,
                "Propagation": ""
            
        ]
  • 可见hdfs容器的文件目录对应的是宿主机的/var/lib/docker/volumes
  • df -m看看磁盘空间情况,如下所示,"/var/lib/docker/volumes"所在的"/dev/nvme0n1p3"设备可用空间只有20多G(29561),显然在保存大量文件时这个空间是不够的,而且hdfs的默认副本数为3:
root@willzhao-deepin:/data/work/spark/temp# df -m
文件系统        1M-块   已用   可用 已用% 挂载点
udev             7893      0   7893    0% /dev
tmpfs            1584      4   1581    1% /run
/dev/nvme0n1p3  43927  12107  29561   30% /
tmpfs            7918      0   7918    0% /dev/shm
tmpfs               5      1      5    1% /run/lock
tmpfs            7918      0   7918    0% /sys/fs/cgroup
/dev/nvme0n1p4  87854    181  83169    1% /home
/dev/nvme0n1p1    300      7    293    3% /boot/efi
/dev/sda1      468428 109152 335430   25% /data
tmpfs            1584      1   1584    1% /run/user/108
tmpfs            1584      0   1584    0% /run/user/0
  • 上面的磁盘信息显示设备/dev/sda1还有300G,所以hdfs的文件目录映射到/dev/sda1就能缓解磁盘空间问题了,于是修改docker-compose.yml文件中hdfs的三个数据节点的配置,修改后如下:
datanode1:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode1
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode1:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode2:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode2
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode2:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode3:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode3
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode3:/hadoop/dfs/data
    env_file:
      - ./hadoop.env
  • 再将下面这段配置删除:
volumes:
  hadoop_namenode:
  hadoop_datanode1:
  hadoop_datanode2:
  hadoop_datanode3:
  hadoop_historyserver:

开发master的4040和work的8080端口

  • 任务运行过程中,如果有UI页面来观察详情,可以帮助我们更全面直观的了解运行情况,所以需要修改配置开放端口;
  • 如下所示,expose参数增加4040,表示对外暴露4040端口,ports参数增加4040:4040,表示容器的4040映射到宿主机的4040端口:
  master:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: master
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    links:
      - namenode
    expose:
      - 4040
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
    volumes:
      - ./conf/master:/conf
      - ./data:/tmp/data
      - ./jars:/root/jars
  • worker的web端口同样需要打开,访问worker的web页面可以观察worker的状态,并且查看任务日志(这个很重要),这里要注意的是由于有多个worker,所以要映射到宿主机的多个端口,如下配置,workder1的environment.SPARK_WORKER_WEBUI_PORT配置为8081,并且暴露8081,再将容器的8081映射到宿主机的8081,workder2的environment.SPARK_WORKER_WEBUI_PORT配置为8082,并且暴露8082,再将容器的8082映射到宿主机的8082:
 worker1:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker1
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker1
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8081
    ports:
      - 8081:8081
    volumes:
      - ./conf/worker1:/conf
      - ./data/worker1:/tmp/data

  worker2:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker2
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker2
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8082
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8082
    ports:
      - 8082:8082
    volumes:
      - ./conf/worker2:/conf
      - ./data/worker2:/tmp/data  
  • worker3-worker6的配置与上面类似,注意用不同的端口号;

  • 至此,修改已经完成,最终版的docker-compose.yml内容如下:
version: "2.2"
services:
  namenode:
    image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8
    container_name: namenode
    volumes:
      - ./hadoop/namenode:/hadoop/dfs/name
      - ./input_files:/input_files
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env
    ports:
      - 50070:50070

  resourcemanager:
    image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8
    container_name: resourcemanager
    depends_on:
      - namenode
      - datanode1
      - datanode2
    env_file:
      - ./hadoop.env

  historyserver:
    image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8
    container_name: historyserver
    depends_on:
      - namenode
      - datanode1
      - datanode2
    volumes:
      - ./hadoop/historyserver:/hadoop/yarn/timeline
    env_file:
      - ./hadoop.env

  nodemanager1:
    image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8
    container_name: nodemanager1
    depends_on:
      - namenode
      - datanode1
      - datanode2
    env_file:
      - ./hadoop.env

  datanode1:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode1
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode1:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode2:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode2
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode2:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode3:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
    container_name: datanode3
    depends_on:
      - namenode
    volumes:
      - ./hadoop/datanode3:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  master:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: master
    command: bin/spark-class org.apache.spark.deploy.master.Master -h master
    hostname: master
    environment:
      MASTER: spark://master:7077
      SPARK_CONF_DIR: /conf
      SPARK_PUBLIC_DNS: localhost
    links:
      - namenode
    expose:
      - 4040
      - 7001
      - 7002
      - 7003
      - 7004
      - 7005
      - 7077
      - 6066
    ports:
      - 4040:4040
      - 6066:6066
      - 7077:7077
      - 8080:8080
    volumes:
      - ./conf/master:/conf
      - ./data:/tmp/data
      - ./jars:/root/jars

  worker1:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker1
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker1
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8081
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8081
    ports:
      - 8081:8081
    volumes:
      - ./conf/worker1:/conf
      - ./data/worker1:/tmp/data

  worker2:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker2
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker2
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8082
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8082
    ports:
      - 8082:8082
    volumes:
      - ./conf/worker2:/conf
      - ./data/worker2:/tmp/data     

  worker3:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker3
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker3
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8083
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8083
    ports:
      - 8083:8083
    volumes:
      - ./conf/worker3:/conf
      - ./data/worker3:/tmp/data

  worker4:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker4
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker4
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8084
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8084
    ports:
      - 8084:8084
    volumes:
      - ./conf/worker4:/conf
      - ./data/worker4:/tmp/data

  worker5:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker5
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker5
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8085
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8085
    ports:
      - 8085:8085
    volumes:
      - ./conf/worker5:/conf
      - ./data/worker5:/tmp/data

  worker6:
    image: gettyimages/spark:2.3.0-hadoop-2.8
    container_name: worker6
    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
    hostname: worker6
    environment:
      SPARK_CONF_DIR: /conf
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 2g
      SPARK_WORKER_PORT: 8881
      SPARK_WORKER_WEBUI_PORT: 8086
      SPARK_PUBLIC_DNS: localhost
    links:
      - master
    expose:
      - 7012
      - 7013
      - 7014
      - 7015
      - 8881
      - 8086
    ports:
      - 8086:8086
    volumes:
      - ./conf/worker6:/conf
      - ./data/worker6:/tmp/data
  • 接下来我们运行一个实例来验证;

验证

  • 在docker-compose.yml所在目录创建hadoop.env文件,内容如下:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
  • 修改好docker-composes.yml后,执行以下命令启动容器:
docker-compose up -d
  • 此次验证所用的spark应用的功能是分析维基百科的网站统计信息,找出访问量最大的网页,本次实战用现成的jar包,不涉及编码,该应用的源码和开发详情请参照《spark实战之:分析维基百科网站统计数据(java版)》

  • 从github下载已经构建好的spark应用jar文件:
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/sparkdemo-1.0-SNAPSHOT.jar
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/pagecounts-20160801-000000
  • 将下载的sparkdemo-1.0-SNAPSHOT.jar文件放在docker-compose.xml所在目录的jars目录下;
  • 在docker-compose.xml所在目录的input_files目录内创建input目录,再将下载的pagecounts-20160801-000000文件放在这个input目录下;
  • 执行以下命令,将整个input目录放入hdfs:
docker exec namenode hdfs dfs -put /input_files/input /
  • 执行以下命令,提交一个任务,使用了12个executor,每个1G内存:
docker exec -it master spark-submit \\
--class com.bolingcavalry.sparkdemo.app.WikiRank \\
--executor-memory 1g \\
--total-executor-cores 12 \\
/root/jars/sparkdemo-1.0-SNAPSHOT.jar \\
namenode \\
8020

欢迎关注51CTO博客:程序员欣宸

以上是关于docker下的spark集群,调整参数榨干硬件的主要内容,如果未能解决你的问题,请参考以下文章

Docker中提交任务到Spark集群

独立集群 + Docker 上的 PySpark 性能不佳

spark 实战 1: 基于gettyimages spark docker image 创建spark 集群

谈Spark下并行执行多个Job的问题

谈Spark下并行执行多个Job的问题

docker下,极速搭建spark集群(含hdfs集群)