Spark on K8S环境部署细节

Posted lanrish

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark on K8S环境部署细节相关的知识,希望对你有一定的参考价值。

Spark on K8S环境部署细节

time: 2020-1-3

Spark on K8S环境部署细节

本文基于阿里云ACK托管K8S集群
分为以下几个部分:

  • spark-operator on ACK 安装
  • spark wordcount读写OSS
  • spark histroy server on ACK 安装

Spark operator安装

准备kubectl客户端和Helm客户端

  • 配置本地或者内网机器kubectl客户端.
  • 安装helm

使用Aliyun 提供的CloudShell进行操作的时候,一来默认不会保存文件,二来容易连接超时,导致安装spark operator失败,重新安装需要手动删除spark operator的各类资源.

安装Helm的方式:

mkdir -pv helm && cd helm
wget https://storage.googleapis.com/kubernetes-helm/helm-v2.9.1-linux-amd64.tar.gz
tar xf helm-v2.9.1-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin
rm -rf linux-amd64

# 查看版本,不显示出server版本,因为还没有安装server
helm version

安装spark operator

helm install incubator/sparkoperator --namespace spark-operator --set sparkJobNamespace=default --set operatorImageName=registry-vpc.us-east-1.aliyuncs.com/eci_open/spark-operator --set operatorVersion=v1beta2-1.0.1-2.4.4 --set enableWebhook=true --set ingressUrlFormat="{{$appName}}.ACK测试域名" --set enableBatchScheduler=true	

Note:

  • operatorImageName:这里的region需要改成k8s集群所在区域,默认谷歌的镜像是没办法拉到的,这里使用aliyun提供的镜像.registry-vpc表示使用内网访问registry下载镜像.
  • ingressUrlFormat: 阿里云的K8S集群会提供一个测试域名,可以替换成自己的.
    安装完毕,我们需要手动创建下serviceaccount,使得后面提交的spark作业可以有权限创建driver,executor对应的pod,configMap等资源.

以下创建default:spark servicecount并绑定相关权限:
创建spark-rbac.yaml,并执行kubectl apply -f spark-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

Spark wordcount 读写OSS

这里分为以下几步:

  • 准备oss依赖的jar包
  • 准备支持oss文件系统的core-site.xml
  • 打包支持读写oss的spark容器镜像
  • 准备wordcount作业

准备oss依赖的jar包

参照链接:https://help.aliyun.com/document_detail/146237.html?spm=a2c4g.11186623.2.16.4dce2e14IGuHEv
以下可以直接操作,下载到oss依赖的jar包

wget http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-hdp-2.6.1.0-129.tar.gz?spm=a2c4g.11186623.2.11.54b56c18VGGAzb&file=hadoop-oss-hdp-2.6.1.0-129.tar.gz

tar -xvf hadoop-oss-hdp-2.6.1.0-129.tar

hadoop-oss-hdp-2.6.1.0-129/
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ram-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-core-3.4.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ecs-4.2.0.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-sts-3.0.0.jar
hadoop-oss-hdp-2.6.1.0-129/jdom-1.1.jar
hadoop-oss-hdp-2.6.1.0-129/aliyun-sdk-oss-3.4.1.jar
hadoop-oss-hdp-2.6.1.0-129/hadoop-aliyun-2.7.3.2.6.1.0-129.jar

准备core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <!-- OSS配置 -->
    <property>
        <name>fs.oss.impl</name>
        <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
    </property>
    <property>
        <name>fs.oss.endpoint</name>
        <value>oss-cn-hangzhou-internal.aliyuncs.com</value>
    </property>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>{临时AK_ID}</value>
    </property>
    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>{临时AK_SECRET}</value>
    </property>
    <property>
        <name>fs.oss.buffer.dir</name>
        <value>/tmp/oss</value>
    </property>
    <property>
        <name>fs.oss.connection.secure.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>fs.oss.connection.maximum</name>
        <value>2048</value>
    </property>
</configuration>

打包支持读写oss的镜像

下载spark安装包解压

wget http://apache.communilink.net/spark/spark-3.0.0-preview/spark-3.0.0-preview-bin-hadoop2.7.tgz
tar -xzvf spark-3.0.0-preview-bin-hadoop2.7.tgz

打包发布镜像

在打包之前,需要准备一个docker registry, 可以是docker hub或者是aliyun提供的远程镜像服务.
这里我们使用aliyun的容器镜像服务

  1. docker登录镜像服务
docker login --username=lanrish@1416336129779449 registry.us-east-1.aliyuncs.com

注:

  • 登录建议使用docker免sudo的方式登录,否则执行sudo docker login登录之后,当前用户无法创建镜像.
  • registry.us-east-1.aliyuncs.com这里根据具体选择的地区来决定,默认通过公网访问,我们可以创建k8s集群和镜像服务在同一个地区下(即配置统一的VPC服务),然后在registry后面加一个-vpc,即registry-vpc.us-east-1.aliyuncs.com,这样k8s可以通过内网快速加载容器镜像.
  1. 打包spark镜像
    进入下载解压好的spark路径: cd spark-3.0.0-preview-bin-hadoop2.7
  2. 将oss依赖的jar拷贝到jars目录.
  3. 将支持oss的core-site.xml放入conf目录.
  4. 修改kubernetes/dockerfiles/spark/Dockerfile
    修改如下,重点在19,34,37行,主要为了可以让spark通过HADOOP_CONF_DIR环境变量去自动加载core-site.xml,之所以这么麻烦而不使用ConfigMap,是因为spark 3.0目前存在bug,详见: https://www.jianshu.com/p/d051aa95b241
  1. FROM openjdk:8-jdk-slim 
  2.  
  3. ARG spark_uid=185 
  4.  
  5. # Before building the docker image, first build and make a Spark distribution following 
  6. # the instructions in http://spark.apache.org/docs/latest/building-spark.html. 
  7. # If this docker file is being used in the context of building your images from a Spark 
  8. # distribution, the docker build command should be invoked from the top level directory 
  9. # of the Spark distribution. E.g.: 
  10. # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile . 
  11.  
  12. RUN set -ex &&  
  13. apt-get update &&  
  14. ln -s /lib /lib64 &&  
  15. apt install -y bash tini libc6 libpam-modules krb5-user libnss3 &&  
  16. mkdir -p /opt/spark &&  
  17. mkdir -p /opt/spark/examples &&  
  18. mkdir -p /opt/spark/work-dir &&  
  19. mkdir -p /opt/hadoop/conf &&  
  20. touch /opt/spark/RELEASE &&  
  21. rm /bin/sh &&  
  22. ln -sv /bin/bash /bin/sh &&  
  23. echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&  
  24. chgrp root /etc/passwd && chmod ug+rw /etc/passwd &&  
  25. rm -rf /var/cache/apt/* 
  26.  
  27. COPY jars /opt/spark/jars 
  28. COPY bin /opt/spark/bin 
  29. COPY sbin /opt/spark/sbin 
  30. COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ 
  31. COPY examples /opt/spark/examples 
  32. COPY kubernetes/tests /opt/spark/tests 
  33. COPY data /opt/spark/data 
  34. COPY conf/core-site.xml /opt/hadoop/conf 
  35. ENV SPARK_HOME /opt/spark 
  36. ENV HADOOP_HOME /opt/hadoop 
  37. ENV HADOOP_CONF_DIR /opt/hadoop/conf 
  38. WORKDIR /opt/spark/work-dir 
  39. RUN chmod g+w /opt/spark/work-dir 
  40.  
  41. ENTRYPOINT [ "/opt/entrypoint.sh" ] 
  42.  
  43. # Specify the User that the actual main process will run as 
  44. USER ${spark_uid} 
  1. 构建镜像
# 构建镜像
./bin/docker-image-tool.sh -r registry.us-east-1.aliyuncs.com/engineplus -t 3.0.0 build  
# 发布镜像
docker push registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0

如果需要在镜像中部署额外的依赖环境,则需要使用以下方式:
在spark当前目录spark-3.0.0-preview-bin-hadoop2.7通过Dockerfile的方式构建自定义镜像:

docker build -t registry.us-east-1.aliyuncs.com/spark:3.0.0 -f  kubernetes/dockerfiles/spark/Dockerfile

可以将自定义的依赖环境定义到kubernetes/dockerfiles/spark/Dockerfile中.

准备wordcount作业

wordcount作业可以从这里clone: https://github.com/i-mine/spark_k8s_wordcount
下载可以直接执行mvn clean package
得到wordcount jar: target/spark_k8s_wordcount-1.0-SNAPSHOT.jar

1. spark submit 提交

注: 这种提交方式中,可以上传本地的jar,但是同时需要本地提交环境已经配置过hadoop关于oss的环境.

bin/spark-submit --master k8s://https://192.168.17.175:6443 --deploy-mode cluster --name com.mobvista.dataplatform.WordCount --class com.mobvista.dataplatform.WordCount --conf spark.kubernetes.file.upload.path=oss://mob-emr-test/lei.du/tmp --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss /home/hadoop/dulei/spark-3.0.0-preview2-bin-hadoop2.7/spark_k8s_wordcount-1.0-SNAPSHOT.jar

2. spark operator 提交

注: 这种提交方式中,spark依赖的jar只可以是镜像中已经存在的或者是通过远程访问,无法自动将本地的jar上传给spark作业,需要自己手动上传到oss或者s3,且spark镜像中已经存在oss或者s3的访问配置和依赖的jar.
编写spark operator word-count.yaml,这种方式需要提前将jar包打包到镜像中,或者上传到云上.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: wordcount
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss"
  imagePullPolicy: IfNotPresent
  mainClass: com.mobvista.dataplatform.WordCount
  mainApplicationFile: "oss://mob-emr-test/lei.du/lib/spark_k8s_wordcount-1.0-SNAPSHOT.jar"
  sparkVersion: "3.0.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 2
    onFailureRetryInterval: 5
    onSubmissionFailureRetries: 2
    onSubmissionFailureRetryInterval: 10
  timeToLiveSeconds: 3600
  sparkConf:
    "spark.kubernetes.allocation.batch.size": "10"
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "oss://mob-emr-test/lei.du/tmp/logs"
  hadoopConfigMap: oss-hadoop-dir
  driver:
    cores: 1
    memory: "1024m"
    labels:
      version: 3.0.0
      spark-app: spark-wordcount
      role: driver
    annotations:
      k8s.aliyun.com/eci-image-cache: "true"
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "1024m"
    labels:
      version: 3.0.0
      role: executor
    annotations:
      k8s.aliyun.com/eci-image-cache: "true"

作业执行过程中我们可以获取ingress-url进行访问WEB UI查看作业执行状态,但是作业执行完毕无法查看:

  1. $ kubectl describe sparkapplication 
  2. Name: wordcount 
  3. Namespace: default 
  4. Labels: <none> 
  5. Annotations: kubectl.kubernetes.io/last-applied-configuration: 
  6. {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},"name":"wordcount","namespace":"defaul... 
  7. API Version: sparkoperator.k8s.io/v1beta2 
  8. Kind: SparkApplication 
  9. Metadata: 
  10. Creation Timestamp: 2020-01-03T08:18:58Z 
  11. Generation: 2 
  12. Resource Version: 53192098 
  13. Self Link: /apis/sparkoperator.k8s.io/v1beta2/namespaces/default/sparkapplications/wordcount 
  14. UID: b0b1ff99-2e01-11ea-bf95-7e8505108e63 
  15. Spec: 
  16. Driver: 
  17. Annotations: 
  18. k8s.aliyun.com/eci-image-cache: true 
  19. Cores: 1 
  20. Labels: 
  21. Role: driver 
  22. Spark - App: spark-wordcount 
  23. Version: 3.0.0 
  24. Memory: 1024m 
  25. Service Account: spark 
  26. Executor: 
  27. Annotations: 
  28. k8s.aliyun.com/eci-image-cache: true 
  29. Cores: 1 
  30. Instances: 1 
  31. Labels: 
  32. Role: executor 
  33. Version: 3.0.0 
  34. Memory: 1024m 
  35. Image: registry.us-east-1.aliyuncs.com/engineplus/spark:3.0.0-oss-wordcount 
  36. Image Pull Policy: IfNotPresent 
  37. Main Application File: /opt/spark/jars/spark_k8s_wordcount-1.0-SNAPSHOT.jar 
  38. Main Class: WordCount 
  39. Mode: cluster 
  40. Restart Policy: 
  41. On Failure Retries: 2 
  42. On Failure Retry Interval: 5 
  43. On Submission Failure Retries: 2 
  44. On Submission Failure Retry Interval: 10 
  45. Type: OnFailure 
  46. Spark Conf: 
  47. spark.kubernetes.allocation.batch.size: 10 
  48. Spark Version: 3.0.0 
  49. Time To Live Seconds: 3600 
  50. Type: Scala 
  51. Status: 
  52. Application State: 
  53. Error Message: driver pod failed with ExitCode: 1, Reason: Error 
  54. State: FAILED 
  55. Driver Info: 
  56. Pod Name: wordcount-driver 
  57. Web UI Address: 172.21.14.219:4040 
  58. Web UI Ingress Address: wordcount.cac1e2ca4865f4164b9ce6dd46c769d59.us-east-1.alicontainer.com 
  59. Web UI Ingress Name: wordcount-ui-ingress 
  60. Web UI Port: 4040 
  61. Web UI Service Name: wordcount-ui-svc 
  62. Execution Attempts: 3 
  63. Last Submission Attempt Time: 2020-01-03T08:21:51Z 
  64. Spark Application Id: spark-4c66cd4e3e094571844bbc355a1b6a16 
  65. Submission Attempts: 1 
  66. Submission ID: e4ce0cb8-7719-4c6f-ade1-4c13e137de77 
  67. Termination Time: 2020-01-03T08:22:01Z 
  68. Events: 
  69. Type Reason Age From Message 
  70. ---- ------ ---- ---- ------- 
  71. Normal SparkApplicationAdded 7m20s spark-operator SparkApplication wordcount was added, enqueuing it for submission 
  72. Warning SparkApplicationFailed 6m20s spark-operator SparkApplication wordcount failed: driver pod failed with ExitCode: 101, Reason: Error 
  73. Normal SparkApplicationSpecUpdateProcessed 5m43s spark-operator Successfully processed spec update for SparkApplication wordcount 
  74. Warning SparkDriverFailed 4m47s (x5 over 7m10s) spark-operator Driver wordcount-driver failed 
  75. Warning SparkApplicationPendingRerun 4m32s (x5 over 7m2s) spark-operator SparkApplication wordcount is pending rerun 
  76. Normal SparkApplicationSubmitted 4m27s (x6 over 7m16s) spark-operator SparkApplication wordcount was submitted successfully 
  77. Normal SparkDriverRunning 4m24s (x6 over 7m14s) spark-operator Driver wordcount-driver is running 

安装Spark Histroy Server On K8S

这里我们使用由Helm chart提供的Spark History Server
GitHub: https://github.com/SnappyDataInc/spark-on-k8s/tree/master/charts/spark-hs?spm=5176.2020520152.0.0.2d5916ddP2xqfh
为了方便,直接通过Aliyun的应用市场进行安装:
应用介绍: https://cs.console.aliyun.com/#/k8s/catalog/detail/incubator_ack-spark-history-server

在创建之前,填写oss相关的配置,然后创建即可:
技术图片

安装完毕通过查看k8s的server,可以获取到spark history server的访问地址
技术图片

创建成功后,提交作业的时候,需要添加两条配置:

 "spark.eventLog.enabled": "true"
 "spark.eventLog.dir": "oss://mob-emr-test/lei.du/tmp/logs"

这样提交的作业日志就会存储在OSS.

技术图片
enter description here

以上是关于Spark on K8S环境部署细节的主要内容,如果未能解决你的问题,请参考以下文章

Spark进阶:Spark on k8s实战

Spark on k8s: 配置和使用ConfigMap

云原生之kubernetes实战在k8s环境下部署Spark分布式计算平台

Spark on YARN的部署

云原生zookeeper + kafka on k8s 环境部署

部署Spark2.2集群(on Yarn模式)