容器部署spark+hadoop+java+scala+推荐服务
Posted ai360
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了容器部署spark+hadoop+java+scala+推荐服务相关的知识,希望对你有一定的参考价值。
内容目录
容器部署spark+hadoop+java+scala+自己服务公网下载镜像创建容器第一种方式:没有指定端口,后期需要增加端口映射需要修改配置文件第二种方式:提前规划好端口或者IP安装应用hadoop配置配置完成启动hadoop容器打包总结FAQ 问题记录
容器部署spark+hadoop+java+scala+自己服务
步骤:公网下载镜像 -> 创建容器 -> 安装应用 -> 容器打包 -> 创建镜像 -> 离线使用镜像
为了简单起见:只记录上述软件单节点安装方式,集群安装可修改相关配置文件,请自行百度。
公网下载镜像
镜像来源:docker.io/centos
docker pull docker.io/centos #下载镜像
查看镜像
docker image ls
完成公网下载镜像到本地,可以通过镜像创建容器
创建容器
自定义容器名字:recSys
创建容器有两种方式,第一种方式一般是实验阶段使用,正式用于生产一般推荐第二种,两者区别在于提前指定容器与宿主机之间映射端口。
第一种方式:没有指定端口,后期需要增加端口映射需要修改配置文件
docker run -it --name=recSys docker.io/centos /bin/bash #创建容器
exit #退出时候容器自动关闭,重新进入容器需要重启容器
docker start recSys #启动容器
docker exec -it recSys /bin/bash #进入容器
第二种方式:提前规划好端口或者IP
结合业务:8099就是hadoop管理监控端口,50070是文件系统管理端口,将宿主机8900映射为8099,50071映射为50070,8002映射为8002。8002是自己应用服务的端口。
[root@AIOps-1 /]# docker run -itd -p 8900:8099 -p 50071:50070 -p 8002:8002 --name=recsys rec_sys_ready /bin/bash #-d表示后台运行
1eaf173131b47858510ca3e9f4e2f5264fb3ec5de938ae3a123a85b670784357
[root@AIOps-1 /]# docker ps #docker ps -a 查看正在或已停止历史运行容器
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1eaf173131b4 rec_sys_ready "/bin/bash" 6 seconds ago Up 5 seconds 0.0.0.0:8002->8002/tcp, 0.0.0.0:8900->8099/tcp, 0.0.0.0:50071->50070/tcp recsys
#这种方式创建容器前台看不了,需要通过exec进入容器
docker exec -it recSys /bin/bash #进入容器
安装应用
软件版本:需要安装jdk,spark,hadoop,scala,注意版本之间配套,如果不知道配套,可以每个软件都下载最新的版本
[root@d9243e9eccf1 opt]# ls
derby.log jdk1.8.0_131 recsys spark-1.5.0-bin-hadoop2.6
hadoop-2.7.3 metastore_db scala-2.11.8
上传软件:这里指从宿主机传到容器里,如果软件不在宿主机,需要将其上传到宿主机,然后再传到容器里。
以spark为例,上传命令类似linux cp,用 docker cp 文件名 容器名称:目录
# 安装spark
[root@AIOps-1 ~]# docker cp spark-1.5.0-bin-hadoop2.6.tgz recSys:/opt/ #将软件包从宿主机复制到容器,recSys:/opt是容器名字和目标目录
#所有软件已上传
[root@d9243e9eccf1 opt]# ls
jdk-8u131-linux-x64.tar.gz scala-2.11.8.tgz
recsys spark-1.5.0-bin-hadoop2.6.tgz
安装前准备
下载的容器里没有ssh服务和客户端,需要手动安装ssh服务和客户端,否则启动hadoop会报错。
centos环境
yum -y install openssh-server openssh-clients
ubuntu环境
apt-get update
apt-get install openssh-server openssh-clients
启动ssh服务,启动需手动创建/var/run/sshd,不然启动sshd的时候会报错
mkdir -p /var/run/sshd
sshd以守护进程运行,注意如果重启容器后需要手动起ssh服务
/usr/sbin/sshd -D &
安装netstat,查看sshd是否监听22端口
apt-get install net-tools 或者yum install net-tools
netstat -apn | grep ssh
如果已经监听22端口,说明sshd服务启动成功
[root@805b87c95cae var]# mkdir -p /var/run/sshd
[root@805b87c95cae var]# /usr/sbin/sshd -D &
[1] 946
[root@805b87c95cae var]# WARNING: 'UsePAM no' is not supported in Fedora and may cause several problems.
[root@805b87c95cae var]# netstat -apn | grep ssh
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 946/sshd
tcp6 0 0 :::22 :::* LISTEN 946/sshd
容器里修改root密码
rpm -e cracklib-dicts --nodeps #查看包是否存在
yum install cracklib-dicts #安装修改密码
passwd root #修改密码
安装各软件
安装步骤
解压文件
配置环境变量
各软件特殊配置
1 解压文件
#在容器解压各工具包,其它工具解压方式类似
tar -zxvf scala-2.11.8.tgz
tar -zxvf spark-1.5.0-bin-hadoop2.6.tgz
#解压后
[root@d9243e9eccf1 opt]# ls
jdk1.8.0_131 recsys spark-1.5.0-bin-hadoop2.6
hadoop-2.7.3 metastore_db scala-2.11.8
2 配置各工具包环境变量
java
[root@d9243e9eccf1 opt]# cd jdk1.8.0_131/
[root@d9243e9eccf1 jdk1.8.0_131]# ls
COPYRIGHT THIRDPARTYLICENSEREADME-JAVAFX.txt db jre release
LICENSE THIRDPARTYLICENSEREADME.txt include lib src.zip
README.html bin javafx-src.zip man
[root@d9243e9eccf1 jdk1.8.0_131]# pwd #jdk路径
/opt/jdk1.8.0_131
[root@d9243e9eccf1 jdk1.8.0_131]# export PATH=$PATH:/opt/jdk1.8.0_131/bin:/opt/jdk1.8.0_131/jre/bin #添加环境变量
[root@d9243e9eccf1 jdk1.8.0_131]# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk1.8.0_131/bin:/opt/jdk1.8.0_131/jre/bin
[root@d9243e9eccf1 jdk1.8.0_131]# java -version #配置完之后验证
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
[root@d9243e9eccf1 jdk1.8.0_131]# javac -version #jdk验证成功
javac 1.8.0_131
scala环境变量配置
[root@d9243e9eccf1 bin]# export PATH=
PATH不能识别此Latex公式:
PATH:/opt/scala-2.11.8/bin/
[root@d9243e9eccf1 bin]# echo
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk1.8.0_131/bin:/opt/jdk1.8.0_131/jre/bin:/opt/scala-2.11.8/bin/
[root@d9243e9eccf1 bin]# cd ..
[root@d9243e9eccf1 scala-2.11.8]# cd ..
[root@d9243e9eccf1 opt]# scala -version #配置成功验证
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
spark配置验证
添加spark环境变量 /opt/spark-1.5.0-bin-hadoop2.6/bin/,并创建spark-env.sh文件并添加配置
/opt/spark-1.5.0-bin-hadoop2.6/conf
[root@d9243e9eccf1 conf]# ls
docker.properties.template metrics.properties.template spark-env.sh.template
fairscheduler.xml.template slaves.template
log4j.properties.template spark-defaults.conf.template
cp spark-env.sh.template spark-env.sh
vim spark-env.sh配置,添加如下配置
export SCALA_HOME=/opt/scala-2.11.8
export JAVA_HOME=/opt/jdk1.8.0_131
export SPARK_MASTER_IP=172.17.0.3
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_IP=172.17.0.3
export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.7.3/bin/hadoop classpath)
启动 spark-shell
21/02/05 06:03:11 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
21/02/05 06:03:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/05 06:03:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
21/02/05 06:03:15 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
21/02/05 06:03:23 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
21/02/05 06:03:23 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
scala>
注意
在终端直接通过export配置环境的方式属于临时配置,如果机器重启,会被清掉,保险做法是将配置写入bashrc文件中[root@AIOps-1 ~]# docker exec -it recsys /bin/bash -c "cat ~/.bashrc"
# .bashrc
...... #省略
JAVA_HOME=/opt/jdk1.8.0_131/jre/bin/
export PATH=PATH:<span class="katex-html" aria-hidden="true" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;"><span class="strut" style="height:0.68333em;vertical-align:0em;" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;"><span class="mord mathit" style="margin-right:0.13889em;" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;">P<span class="mord mathit" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;">A<span class="mord mathit" style="margin-right:0.13889em;" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;">T<span class="mord mathit" style="margin-right:0.08125em;" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;">H<span class="mspace" style="margin-right:0.2777777777777778em;" style="font-size: inherit;color: inherit;line-height: inherit;overflow-wrap: inherit !important;word-break: inherit !important;">:JAVA_HOME
export PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/jdk1.8.0_131/jre/bin/:/opt/hadoop-2.7.3/bin/:/opt/scala-2.11.8/bin/:/opt/spark-1.5.0-bin-hadoop2.6/bin/ #全配置
</span class="mspace" style="margin-right:0.2777777777777778em;"></span class="mord mathit" style="margin-right:0.08125em;"></span class="mord mathit" style="margin-right:0.13889em;"></span class="mord mathit"></span class="mord mathit" style="margin-right:0.13889em;"></span class="strut" style="height:0.68333em;vertical-align:0em;"></span class="katex-html" aria-hidden="true">
3 各软件特殊配置
hadoop配置
1 . hadoop_env.sh和yarn-env.sh配置jdk路径
打开/opt/hadoop-2.7.3/etc/hadoop/hadoop_env.sh,添加jdk路径。注意是jdk的路径,不是bin, jre下的启动目录
export JAVA_HOME=/opt/jdk1.8.0_131
打开/opt/hadoop-2.7.3/etc/hadoop/yarn-env.sh,添加jdk路径。注意是jdk的路径,不是bin, jre下的启动目录
export JAVA_HOME=/opt/jdk1.8.0_131
2. 配置完验证
[root@d9243e9eccf1 hadoop]# hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar3. yarm-site.xml、core-site.xml、hdfs-site.xml、map-reduce.site.xml 四个文件配置
yarm-site.xml配置,路径/opt/hadoop-2.7.3/etc/hadoop
先获取hadoop classpath
[root@e8d02f7baae1 hadoop]# hadoop classpath
/opt/hadoop-2.7.3/etc/hadoop:/opt/hadoop-2.7.3/share/hadoop/common/lib/*:/opt/hadoop-2.7.3/share/hadoop/common/*:/opt/hadoop-2.7.3/share/hadoop/hdfs:/opt/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/opt/hadoop-2.7.3/share/hadoop/hdfs/*:/opt/hadoop-2.7.3/share/hadoop/yarn/lib/*:/opt/hadoop-2.7.3/share/hadoop/yarn/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar复制到yarm-site.xml value字段
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8099</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/opt/hadoop-2.7.3/etc/hadoop:/opt/hadoop-2.7.3/share/hadoop/common/lib/*:/opt/hadoop-2.7.3/share/hadoop/common/*:/opt/hadoop-2.7.3/share/hadoop/hdfs:/opt/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/opt/hadoop-2.7.3/share/hadoop/hdfs/*:/opt/hadoop-2.7.3/share/hadoop/yarn/lib/*:/opt/hadoop-2.7.3/share/hadoop/yarn/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar</value>
</property>
</configuration>
修改core-site.xml、hdfs-site.xml、map-reduce.site.xml文件之前,现在手动创建文件系统目录 /data, 如
core-site.xml,文件路径 /opt/hadoop-2.7.3/etc/hadoop
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://172.17.0.3:9000</value> #172.17.0.3为容器ip地址,可改为主机名
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value> #本地路径
</property>
</configuration>
hdfs-site.xml,文件路径 /opt/hadoop-2.7.3/etc/hadoop
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/data/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
map-reduce.site.xml,文件路径 /opt/hadoop-2.7.3/etc/hadoop
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>172.17.0.3:9001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/data/hadoop/var</value>
</property>
</configuration>
配置完成启动hadoop
启动方式
先格式化文件系统:bin/hdfs namenode -format
启动namenode: sbin/hadoop-daemon.sh start namenode
启动datanode: sbin/hadoop-daemon.sh start datanode
备注:2,3可以合并为:sbin/start-all.sh
启动后验证 jps查看进程正常能看到5个进程
1056 NameNode
1712 ResourceManager
28532 Jps
2059 NodeManager
1501 SecondaryNameNode
1262 DataNode网页端验证
http://10.12.33.200:50071/dfshealth.html#tab-overview
http://10.12.33.200:8900/cluster #注意8900宿主机端口,容器端口是8099
文件系统测试
[root@e8d02f7baae1 ~]# hdfs dfs -mkdir /log #创建日志目录,注意根目录是默认存在的,所有文件都位于根目录下,这里路径和hdfs-site.xml配置路径没关系
[root@e8d02f7baae1 ~]# hdfs dfs -ls /log
[root@e8d02f7baae1 ~]# hdfs dfs -ls /
Found 3 items
drwxr-xr-x - root supergroup 0 2021-02-18 07:26 /data
drwxr-xr-x - root supergroup 0 2021-02-20 08:42 /log
drwx------ - root supergroup 0 2021-02-07 09:01 /tmp
[root@e8d02f7baae1 ~]# ls
anaconda-ks.cfg anaconda-post.log original-ks.cfg
[root@e8d02f7baae1 ~]# hdfs dfs -put anaconda-post.log /log
[root@e8d02f7baae1 ~]# hdfs dfs -ls /log
Found 1 items
-rw-r--r-- 1 root supergroup 435 2021-02-20 08:43 /log/anaconda-post.log
以上完成软件安装,然后再把应用复制到容器中, 测试成功以后将容器打包线上使用
容器打包
生成新镜像,recsys为新镜像名字
[root@AIOps-1 ~]# docker commit -a "chen" -m "recomment system" recSys recsys:1.0
sha256:250b235dcdfa5fd20e1923524eaa6b82af4aa7f21dfae88a147d5397597d5bf3
打包生成的镜像
docker save -o recsys.tar
recsys.tar就是打包好的镜像,将该镜像复制到客户环境或其它机器,通过load命令生成可用镜像
docker load --input recsys.tar
总结
以上步骤就是通过容器部署自己大数据环境,搭建了java+spark+hadoop+scala的环境,并部署好了自己应用程序,然后将容器打包移植到线上使用,由于项目隐私,部署自己应用服务部分省略了。
FAQ 问题记录
问题1:查看hadoop版本时候,找不到java路径,这个问题需要配置jdk环境
[root@d9243e9eccf1 sbin]# hadoop version
Error: JAVA_HOME is not set and could not be found.问题2:启动hadoop文件系统报错
[root@d9243e9eccf1 hadoop-2.7.3]# sbin/start-dfs.sh
Starting namenodes on [d9243e9eccf1]
d9243e9eccf1: /opt/hadoop-2.7.3/sbin/slaves.sh: line 60: ssh: command not found
localhost: /opt/hadoop-2.7.3/sbin/slaves.sh: line 60: ssh: command not found
Starting secondary namenodes [0.0.0.0]
0.0.0.0: /opt/hadoop-2.7.3/sbin/slaves.sh: line 60: ssh: command not found问题3:容器里没法ssh到其它服务器,问题原因是没有安装ssh客户端openssh-clients,导致ssh命令用不了
[root@d9243e9eccf1 ssh]# ssh #没有安装ssh客户端
bash: ssh: command not found
[root@d9243e9eccf1 ssh]# ps -ef | grep sshd #ssh进程正常
root 2153 0 0 10:41 ? 00:00:00 /usr/sbin/sshd
root 2179 18 0 10:47 ? 00:00:00 grep --color=auto sshd
[root@AIOps-1 opt]# ssh root@172.17.0.3 #安装成功ssh-client之后,可以使用ssh
root@172.17.0.3's password:
以上是关于容器部署spark+hadoop+java+scala+推荐服务的主要内容,如果未能解决你的问题,请参考以下文章
原创 Spark动手实践 1Hadoop2.7.3安装部署实际动手