编写Spark的WordCount程序并提交到集群运行[含scala和java两个版本]

Posted 十光年

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了编写Spark的WordCount程序并提交到集群运行[含scala和java两个版本]相关的知识,希望对你有一定的参考价值。

编写Spark的WordCount程序并提交到集群运行[含scala和java两个版本]

1. 开发环境

Jdk 1.7.0_72
Maven 3.2.1
Scala 2.10.6
Spark 1.6.2
Hadoop 2.6.4
IntelliJ IDEA 2016.1.1

 

2. 创建项目
1) 新建Maven项目

 

 

2) 在pom文件中导入依赖
pom.xml文件内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.xuebusi</groupId>
    <artifactId>spark</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <encoding>UTF-8</encoding>

        <!-- 这里对jar包版本做集中管理 -->
        <scala.version>2.10.6</scala.version>
        <spark.version>1.6.2</spark.version>
        <hadoop.version>2.6.4</hadoop.version>
    </properties>

    <dependencies>
        <dependency>
            <!-- scala语言核心包 -->
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <!-- spark核心包 -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <!-- hadoop的客户端,用于访问HDFS -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-make:transitive</arg>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <!-- 由于我们的程序可能有很多,所以这里可以不用指定main方法所在的类名,我们可以在提交spark程序的时候手动指定要调用那个main方法 -->
                            <!--
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.xuebusi.spark.WordCount</mainClass>
                                </transformer>
                            </transformers>
                            -->

                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

 


虽然我们的pom文件中的jar包依赖准备好了,但是在Project的External Libraries缺少Maven依赖:

 

需要点击右侧的Maven Project侧边栏中的刷新按钮,才会导入Maven依赖,前提是保证电脑能够联网,Maven可能会到中央仓库下载一些依赖:

 

在左侧的Project侧边栏中的External Libraries下就可以看到新导入的Maven依赖了:

 

但是在pom.xml文件中还有错误提示,因为src/main/和src/test/这两个目录下面没有scala目录:

 

分别在main和test目录之上点击鼠标右键选择new->Directory创建scala目录:

 

由于新创建的scala文件夹前面的图标颜色和java文件夹不一样,我们需要再次点击右侧Maven Project侧边栏中的刷新按钮,其颜色就会发生变化:

 

在scala目录下面创建WordCount(类型为Object):

 

 

3. 编写WordCount程序
下面是使用scala语言编写的Spark的一个简单的单词计数程序:

package com.xuebusi.spark

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by SYJ on 2017/1/23.
  */
object WordCount {
  def main(args: Array[String]) {
    //创建SparkConf
    val conf: SparkConf = new SparkConf()
    //创建SparkContext
    val sc: SparkContext = new SparkContext(conf)
    //从文件读取数据
    val lines: RDD[String] = sc.textFile(args(0))
    //按空格切分单词
    val words: RDD[String] = lines.flatMap(_.split(" "))
    //单词计数,每个单词每出现一次就计数为1
    val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
    //聚合,统计每个单词总共出现的次数
    val result: RDD[(String, Int)] = wordAndOne.reduceByKey(_+_)
    //排序,根据单词出现的次数排序
    val fianlResult: RDD[(String, Int)] = result.sortBy(_._2, false)
    //将统计结果保存到文件
    fianlResult.saveAsTextFile(args(1))
    //释放资源
    sc.stop()
  }
}

 


4. 打包
将编写好的WordCount程序使用Maven插件打成jar包,打包的时候也要保证电脑能够联网,因为Maven可能会到中央仓库中下载一些依赖:

 

 

 

在jar包名称上面点击鼠标右键选择“Copy Path”,得到jar包在Windows磁盘上的绝对路径:D:\\bigdatacode\\xbs-spark\\target\\spark-1.0-SNAPSHOT.jar,在下面上传jar包时会用到此路径。

5. 上传jar包
使用SecureCRT工具连接Spark集群服务器,将spark-1.0-SNAPSHOT.jar上传到服务器:

 

6. 同步时间

date -s "2017-01-23 19:19:30"

 

7. 启动Zookeeper

/root/apps/zookeeper/bin/zkServer.sh start

 

8. 启动hdfs

/root/apps/hadoop/sbin/start-dfs.sh

 

HDFS的活跃的NameNode节点:

 

HDFS的备选NameNode节点:

 

9. 启动Spark集群

/root/apps/spark/sbin/start-all.sh

 

启动单个Master进程使用如下命令:

/root/apps/spark/sbin/start-master.sh

 

Spark活跃的Master节点:

 

Spark的备选Master节点:

 

10. 准备输入数据

 

11. 提交Spark程序

提交Spark的WordCount程序需要两个参数,一个输入目录,一个输出目录,首先确定输出目录不存在,如果存在则删除:

hdfs dfs -rm -r /wordcount/output

 

使用spark-submit脚本提交spark程序:

/root/apps/spark/bin/spark-submit \\
--master spark://hadoop01:7077,hadoop02:7077 \\
--executor-memory 512m \\
--total-executor-cores 7 \\
--class com.xuebusi.spark.WordCount /root/spark-1.0-SNAPSHOT.jar hdfs://hadoop01:9000/wordcount/input hdfs://hadoop01:9000/wordcount/output

 

通过Spark的UI界面来观察程序执行过程:

 

 

 

12. 查看结果

 


附1:程序打包日志

 

  1 D:\\java\\jdk1.7.0_72\\bin\\java -Dmaven.home=D:\\apache-maven-3.2.1 -Dclassworlds.conf=D:\\apache-maven-3.2.1\\bin\\m2.conf -Didea.launcher.port=7533 "-Didea.launcher.bin.path=D:\\java\\IntelliJ_IDEA\\IntelliJ IDEA Community Edition 2016.1.1\\bin" -Dfile.encoding=UTF-8 -classpath "D:\\apache-maven-3.2.1\\boot\\plexus-classworlds-2.5.1.jar;D:\\java\\IntelliJ_IDEA\\IntelliJ IDEA Community Edition 2016.1.1\\lib\\idea_rt.jar" com.intellij.rt.execution.application.AppMain org.codehaus.classworlds.Launcher -Didea.version=2016.1.1 package
  2 [INFO] Scanning for projects...
  3 [INFO] 
  4 [INFO] Using the builder org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder with a thread count of 1
  5 [INFO]                                                                         
  6 [INFO] ------------------------------------------------------------------------
  7 [INFO] Building spark 1.0-SNAPSHOT
  8 [INFO] ------------------------------------------------------------------------
  9 [INFO] 
 10 [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ spark ---
 11 [INFO] Using \'UTF-8\' encoding to copy filtered resources.
 12 [INFO] Copying 0 resource
 13 [INFO] 
 14 [INFO] --- maven-compiler-plugin:2.5.1:compile (default-compile) @ spark ---
 15 [INFO] Nothing to compile - all classes are up to date
 16 [INFO] 
 17 [INFO] --- scala-maven-plugin:3.2.2:compile (default) @ spark ---
 18 [WARNING]  Expected all dependencies to require Scala version: 2.10.6
 19 [WARNING]  com.xuebusi:spark:1.0-SNAPSHOT requires scala version: 2.10.6
 20 [WARNING]  com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4
 21 [WARNING] Multiple versions of scala libraries detected!
 22 [INFO] Nothing to compile - all classes are up to date
 23 [INFO] 
 24 [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ spark ---
 25 [INFO] Using \'UTF-8\' encoding to copy filtered resources.
 26 [INFO] skip non existing resourceDirectory D:\\bigdatacode\\spark-wordcount\\src\\test\\resources
 27 [INFO] 
 28 [INFO] --- maven-compiler-plugin:2.5.1:testCompile (default-testCompile) @ spark ---
 29 [INFO] Nothing to compile - all classes are up to date
 30 [INFO] 
 31 [INFO] --- scala-maven-plugin:3.2.2:testCompile (default) @ spark ---
 32 [WARNING]  Expected all dependencies to require Scala version: 2.10.6
 33 [WARNING]  com.xuebusi:spark:1.0-SNAPSHOT requires scala version: 2.10.6
 34 [WARNING]  com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4
 35 [WARNING] Multiple versions of scala libraries detected!
 36 [INFO] No sources to compile
 37 [INFO] 
 38 [INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ spark ---
 39 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-booter/2.12.4/surefire-booter-2.12.4.pom
 40 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-booter/2.12.4/surefire-booter-2.12.4.pom (3 KB at 1.7 KB/sec)
 41 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-api/2.12.4/surefire-api-2.12.4.pom
 42 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-api/2.12.4/surefire-api-2.12.4.pom (3 KB at 2.4 KB/sec)
 43 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.12.4/maven-surefire-common-2.12.4.pom
 44 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.12.4/maven-surefire-common-2.12.4.pom (6 KB at 3.2 KB/sec)
 45 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-annotations/3.1/maven-plugin-annotations-3.1.pom
 46 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-annotations/3.1/maven-plugin-annotations-3.1.pom (2 KB at 1.7 KB/sec)
 47 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-tools/3.1/maven-plugin-tools-3.1.pom
 48 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-tools/3.1/maven-plugin-tools-3.1.pom (16 KB at 12.0 KB/sec)
 49 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.12.4/maven-surefire-common-2.12.4.jar
 50 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-api/2.12.4/surefire-api-2.12.4.jar
 51 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-booter/2.12.4/surefire-booter-2.12.4.jar
 52 Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-annotations/3.1/maven-plugin-annotations-3.1.jar
 53 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/plugin-tools/maven-plugin-annotations/3.1/maven-plugin-annotations-3.1.jar (14 KB at 10.6 KB/sec)
 54 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-booter/2.12.4/surefire-booter-2.12.4.jar (34 KB at 21.5 KB/sec)
 55 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/maven-surefire-common/2.12.4/maven-surefire-common-2.12.4.jar (257 KB at 161.0 KB/sec)
 56 Downloaded: http://repo.maven.apache.org/maven2/org/apache/maven/surefire/surefire-api/2.12.4/surefire-api-2.12.4.jar (115 KB at 55.1 KB/sec)
 57 [INFO] No tests to run.
 58 [INFO] 
 59 [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ spark ---
 60 [INFO] Building jar: D:\\bigdatacode\\spark-wordcount\\target\\spark-1.0-SNAPSHOT.jar
 61 [INFO] 
 62 [INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark ---
 63 [INFO] Including org.scala-lang:scala-library:jar:2.10.6 in the shaded jar.
 64 [INFO] Including org.apache.spark:spark-core_2.10:jar:1.6.2 in the shaded jar.
 65 [INFO] Including org.apache.avro:avro-mapred:jar:hadoop2:1.7.7 in the shaded jar.
 66 [INFO] Including org.apache.avro:avro-ipc:jar:1.7.7 in the shaded jar.
 67 [INFO] Including org.apache.avro:avro-ipc:jar:tests:1.7.7 in the shaded jar.
 68 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.13 in the shaded jar.
 69 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13 in the shaded jar.
 70 [INFO] Including com.twitter:chill_2.10:jar:0.5.0 in the shaded jar.
 71 [INFO] Including com.esotericsoftware.kryo:kryo:jar:2.21 in the shaded jar.
 72 [INFO] Including com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07 in the shaded jar.
 73 [INFO] Including com.esotericsoftware.minlog:minlog:jar:1.2 in the shaded jar.
 74 [INFO] Including org.objenesis:objenesis:jar:1.2 in the shaded jar.
 75 [INFO] Including com.twitter:chill-java:jar:0.5.0 in the shaded jar.
 76 [INFO] Including org.apache.xbean:xbean-asm5-shaded:jar:4.4 in the shaded jar.
 77 [INFO] Including org.apache.spark:spark-launcher_2.10:jar:1.6.2 in the shaded jar.
 78 [INFO] Including org.apache.spark:spark-network-common_2.10:jar:1.6.2 in the shaded jar.
 79 [INFO] Including org.apache.spark:spark-network-shuffle_2.10:jar:1.6.2 in the shaded jar.
 80 [INFO] Including org.fusesource.leveldbjni:leveldbjni-all:jar:1.8 in the shaded jar.
 81 [INFO] Including com.fasterxml.jackson.core:jackson-annotations:jar:2.4.4 in the shaded jar.
 82 [INFO] Including org.apache.spark:spark-unsafe_2.10:jar:1.6.2 in the shaded jar.
 83 [INFO] Including net.java.dev.jets3t:jets3t:jar:0.7.1 in the shaded jar.
 84 [INFO] Including commons-codec:commons-codec:jar:1.3 in the shaded jar.
 85 [INFO] Including commons-httpclient:commons-httpclient:jar:3.1 in the shaded jar.
 86 [INFO] Including org.apache.curator:curator-recipes:jar:2.4.0 in the shaded jar.
 87 [INFO] Including org.apache.curator:curator-framework:jar:2.4.0 in the shaded jar.
 88 [INFO] Including org.apache.zookeeper:zookeeper:jar:3.4.5 in the shaded jar.
 89 [INFO] Including jline:jline:jar:0.9.94 in the shaded jar.
 90 [INFO] Including com.google.guava:guava:jar:14.0.1 in the shaded jar.
 91 [INFO] Including org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016 in the shaded jar.
 92 [INFO] Including org.apache.commons:commons-lang3:jar:3.3.2 in the shaded jar.
 93 [INFO] Including org.apache.commons:commons-math3:jar:3.4.1 in the shaded jar.
 94 [INFO] Including com.google.code.findbugs:jsr305:jar:1.3.9 in the shaded jar.
 95 [INFO] Including org.slf4j:slf4j-api:jar:1.7.10 in the shaded jar.
 96 [INFO] Including org.slf4j:jul-to-slf4j:jar:1.7.10 in the shaded jar.
 97 [INFO] Including org.slf4j:jcl-over-slf4j:jar:1.7.10 in the shaded jar.
 98 [INFO] Including log4j:log4j:jar:1.2.17 in the shaded jar.
 99 [INFO] Including org.slf4j:slf4j-log4j12:jar:1.7.10 in the shaded jar.
100 [INFO] Including com.ning:compress-lzf:jar:1.0.3 in the shaded jar.
101 [INFO] Including org.xerial.snappy:snappy-java:jar:1.1.2.1 in the shaded jar.
102 [INFO] Including net.jpountz.lz4:lz4:jar:1.3.0 in the shaded jar.
103 [INFO] Including org.roaringbitmap:RoaringBitmap:jar:0.5.11 in the shaded jar.
104 [INFO] Including commons-net:commons-net:jar:2.2 in the shaded jar.
105 [INFO] Including com.typesafe.akka:akka-remote_2.10:jar:2.3.11 in the shaded jar.
106 [INFO] Including com.typesafe.akka:akka-actor_2.10:jar:2.3.11 in the shaded jar.
107 [INFO] Including com.typesafe:config:jar:1.2.1 in the shaded jar.
108 [INFO] Including io.netty:netty:jar:3.8.0.Final in the shaded jar.
109 [INFO] Including com.google.protobuf:protobuf-java:jar:2.5.0 in the shaded jar.
110 [INFO] Including org.uncommons.maths:uncommons-maths:jar:1.2.2a in the shaded jar.
111 [INFO] Including com.typesafe.akka:akka-slf4j_2.10:jar:2.3.11 in the shaded jar.
112 [INFO] Including org.json4s:json4s-jackson_2.10:jar:3.2.10 in the shaded jar.
113 [INFO] Including org.json4s:json4s-core_2.10:jar:3.2.10 in the shaded jar.
114 [INFO] Including org.json4s:json4s-ast_2.10:jar:3.2.10 in the shaded jar.
115 [INFO] Including org.scala-lang:scalap:jar:2.10.0 in the shaded jar.
116 [INFO] Including org.scala-lang:scala-compiler:jar:2.10.0 in the shaded jar.
117 [INFO] Including com.sun.jersey:jersey-server:jar:1.9 in the shaded jar.
118 [INFO] Including asm:asm:jar:3.1 in the shaded jar.
119 [INFO] Including com.sun.jersey:jersey-core:jar:1.9 in the shaded jar.
120 [INFO] Including org.apache.mesos:mesos:jar:shaded-protobuf:0.21.1 in the shaded jar.
121 [INFO] Including io.netty:netty-all:jar:4.0.29.Final in the shaded jar.
122 [INFO] Including com.clearspring.analytics:stream:jar:2.7.0 in the shaded jar.
123 [INFO] Including io.dropwizard.metrics:metrics-core:jar:3.1.2 in the shaded jar.
124 [INFO] Including io.dropwizard.metrics:metrics-jvm:jar:3.1.Spark在Yarn上运行Wordcount程序

Flink 本地运行 Web UI

Flink 本地运行 Web UI

Git 创建新分支并提交到远程仓

大数据Spark入门以及集群搭建

git 创建分支并提交到远程