Spark集群搭建与测试(DT大数据梦工厂)
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark集群搭建与测试(DT大数据梦工厂)相关的知识,希望对你有一定的参考价值。
Spark流行的两种文件存储方式:1、Hadoop的HDFS;2、H3云存储
tux yarn +HDFS是未来3、5年的趋势
看你用的是bash,可能ubuntu里的bash不会自动source /etc/profile,所以你将那条export命令放在~/.bashrc里试试
计算的集群和数据存储的集群不在同一个集群上的话,性能不高不可接受,tux yarn解决了这个问题,它用JAVA写的
ubuntu
设置root登录见http://jingyan.baidu.com/article/148a1921a06bcb4d71c3b1af.html
1、ssh的安装:
sudo apt-get install ssh
sudo apt-get install rsync 以后集群用到
2、安装JAVA
sudo apt-get install rpm 将rpm的安装的东西先安装好
3、ssh生成key
ssh-keygen -t dsa -P ‘‘ -f ~/.ssh/id_dsa //生成key到~/.ssh/id_dsa中
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys //追加到key中
然后可以
[email protected]:~# ssh localhost
Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-30-generic x86_64)
* Documentation: https://help.ubuntu.com/
429 packages can be updated.
220 updates are security updates.
Last login: Mon Feb 1 23:52:43 2016 from localhost
4、vim /etc/hostname
//改本机的名字
vim /etc/hosts
//改域名
5、安装Hadoop
cd hadoop/etc/hadoop下面的各个文件
文件:core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://Master:9000</value> //默认文件系统的方式
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/tmp</value> //放临时数据,尤其eclipse开发时候
</property>
<property>
<name>hadoop.native.lib</name> //初学的时候暂时不用关心
<value>true</value>
<description>Should native hadoop libraries,if present,be used</description>
</property>
</configuration>
文件:hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value> //存副本的数量
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>Master:50090</value> //Master、Slave结构,做镜像用的
<description>The secondary namenode http server address and port</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/dfs/data</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///usr/local/hadoop/hadoop-2.6.0/dfs/namesecondary</value>
<description>Determines where on the local filesystem the DFSsecondary name node should store th temporary images to merge.If this is acomma-delimited list of directories then the image is replicated in all of the irectories foe redundancy.</description>
</property>
</configuration>
计算模型:
文件:mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>// 【指定运行mapreduce的环境是yarn,与hadoop1截然不同的地方】
</property>
</configuration>
文件:yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Master</value> //自定ResourceManager的地址,还是单点,这是隐患
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
文件:hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/java/jdk1.8.0_71
6、配置Hadoop环境变量
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
. /etc/profile
7、看slaves
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# vi slaves
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# pwd
/usr/local/hadoop-2.6.0/etc/hadoop
老师的电脑:
7、格式化系统
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# scp masters [email protected]:/usr/local/hadoop-2.6.0/etc/hadoop
masters 100% 7 0.0KB/s 00:00
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# scp masters [email protected]:/usr/local/hadoop-2.6.0/etc/hadoop
masters 100% 7 0.0KB/s 00:00
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# scp slaves [email protected]:/usr/local/hadoop-2.6.0/etc/hadoop
slaves 100% 23 0.0KB/s 00:00
[email protected]:/usr/local/hadoop-2.6.0/etc/hadoop# scp slaves [email protected]:/usr/local/hadoop-2.6.0/etc/hadoop
slaves 100% 23 0.0KB/s 00:00
hdfs namenode -format
8、启动HDFS系统
[email protected]:/usr/local/hadoop-2.6.0/sbin# pwd
/usr/local/hadoop-2.6.0/sbin
[email protected]:/usr/local/hadoop-2.6.0/sbin# ./start-dfs.sh
如果启动出错,改动hosts之后怎么样copy
[email protected]:~# jps //查看进程的
8290 NameNode
8723 Jps
8403 DataNode
8590 SecondaryNameNode
http://Master:50070/dfshealth.html 可以看dfs文件状态
9、启动资源管理框架
cd /usr/local/hadoop-2.6.0/sbin
./start-yarn.sh
[email protected]:/usr/local/hadoop-2.6.0/sbin# jps
8960 NodeManager
8290 NameNode
8403 DataNode
9256 Jps
8843 ResourceManager
8590 SecondaryNameNode
http://master:8088/cluster 可以看
//////////////Spark角度讲,Hadoop安装到这个程度差不多了///////////////
10、安装spark-1.6.0-bin-hadoop2.6.tgz
tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz
解压到/usr/local
cp出来spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_71
export SCALA_HOME=/usr/local/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_CONF_DIR=/usr/local/hadoop-2.6.0/etc/hadoop //运行在yarn模式下必须配置
export SPARK_MASTER_IP=Master //Saprk运行的主ip
export SPARK_WORKER_MEMORY=2g //具体机器
export SPARK_EXCUTOR_MEMORY=2g //具体计算
export SPARK_DRIVER_MEMORY=2g
export SPARK_WORKER_CORES=8 //线程池并发数
cp出来slaves
配置系统环境变量
export SPARK_HOME=/usr/local/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH
scp ~/.bashrc [email protected]:~/.bashrc
改完后再同步到其它机器,用上面的方式
cp spark-defaults.conf出来
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true //程序运行的信息
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled true
spark.eventLog.dir hdfs://Master:9000/historyserverforSpark
spark.yarn.historyServer.address Master:18080
spark.history.fs.logDorectory hdfs://Master:9000/historyserverforSpark
#spark.default.paralielism 100
然后是拷贝到各其它机器,所有机器的路径都一样的
scp -r ./spark-1.6.0-bin-hadoop2.6/ [email protected]:/usr/local/spark
到spark的sbin
第一次跑的时候
hadoop dfs -rmr /historyserverforSpark
hadoop dfs -mkdir /historyserverforSpark
然后这里就有了:
启动Spark
./start-all.sh
看web控制台
http://master:8080/
再启动
./start-history-server.sh
cd ../bin
http://spark.apache.org/docs/1.6.0/submitting-applications.html
官网例子
# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000 //并行运行线程数量
改进运行
./spark-submit --class org.apache.spark.examples.SparkPi --master spark://Master:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
此时在管理控制台也可以看到
老师机器,机器太好,一会就跑完:
我的机器,机器太烂:
貌似不是集群跑不起来
晚上加了集群:
改成本地试试
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
./spark-submit
--class org.apache.spark.examples.SparkPi --master local[8] ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
真跑出来了,看来要好好搞下集群那个了
程序启动时候分配资源叫粗粒度Coarse Grained
程序计算时分配资源,用完就回收叫细粒度
spark-shell演示程序
./spark-shell --master spark://Master:7077
./spark-shell --master local[8]
sc.textFile("/library/wordcount/input/Data").flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_).map(pair=>(pair._2,pair._1)).sortByKey(false).map(pair=>(pair._1,pair._2)).saveAsTextFile("/library/wordcount/output/dt_spark_clicked1")
//把key、value互换位置然后升序排列,然后再换回来,就可以根据value进行排序了
sc.textFile("/library/wordcount/input/Data").flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_).map(pair=>(pair._2,pair._1)).sortByKey(false,1).map(pair=>(pair._1,pair._2)).saveAsTextFile("/library/wordcount/output/dt_spark_clicked2") //变成1个文件
最后,hadoop自带的wordcount
hadoop jar hadoop-examples-1.2.1.jar wordcount input putput
作业:
自己构件Spark集群或者本地模式,并且截图放在模式
本文出自 “一枝花傲寒” 博客,谢绝转载!
以上是关于Spark集群搭建与测试(DT大数据梦工厂)的主要内容,如果未能解决你的问题,请参考以下文章