spark+hive在CDH5.13.1环境下基本使用

Posted 徐大嘴

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark+hive在CDH5.13.1环境下基本使用相关的知识,希望对你有一定的参考价值。

  • spark使用

    可操作终端:cmdn01/cmdn02/cmdn03

    前提:

    操作用户:bigdata 用户目录:/home/bigdata/ 

    Spark数据目录:/home/bigdata/spark

    Hive数据目录:/home/bigdata/hive

    权限:对bigdata/spark两个目录,其他用户具有可执行权限

样例:单词计数-wordCount

一:准备数据

[bigdata@cmdn01 spark]$ more  wordcount.txt

company header world

world beijing

computer tailer converse

[bigdata@cmdn01 spark]$

二:启动spark-shell

[bigdata@cmdn01 spark]$ spark-shell

Setting default log level to  "WARN".

To adjust logging level use sc.setLogLevel(newLevel).

Welcome to

       ____              __

      / __/__  ___ _____/ /__

     _\ \/ _ \/ _ `/ __/  '_/

    /___/ .__/\_,_/_/ /_/\_\    version 1.6.0

       /_/

 

Using Scala version 2.10.5 (Java  HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)

Type in expressions to have them  evaluated.

Type :help for more information.

Spark context available as sc (master =  yarn-client, app id = application_1524465049561_0007).

SQL context available as sqlContext.

 

scala>

 

三:操作如下:

定义文件:

scala> val wordcount = sc.textFile("file:///home/bigdata/spark/wordcount.txt")

计算单词个数:

scala> val counts =

wordcount .flatMap(line =>  line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)

结果写入HDFS

scala>  counts.saveAsTextFile("/user/bigdata/wordcount")

 

四:查看结果文件

[bigdata@cmdn01 spark]$ hdfs dfs -ls  /user/bigdata/wordcount

Found 3 items

-rw-r--r--   3 bigdata bigdata          0 2018-04-23 17:02  /user/bigdata/wordcount/_SUCCESS

-rw-r--r--   3 bigdata bigdata         19 2018-04-23 17:02  /user/bigdata/wordcount/part-00000

-rw-r--r--   3 bigdata bigdata          0 2018-04-23 17:02  /user/bigdata/wordcount/part-00001

[bigdata@cmdn01 spark]$ hdfs dfs -cat  /user/bigdata/wordcount/part-00000

(word,2)

(world,1)

 

五:用hue查看

url:http://cmhu01:8888

用bigdata/bigdata登陆

  • hive使用

   

一:启动hive

[bigdata@cmdn01 spark]$ hive

Java HotSpot(TM) 64-Bit Server VM  warning: ignoring option MaxPermSize=512M; support was removed in 8.0

Java HotSpot(TM) 64-Bit Server VM  warning: ignoring option MaxPermSize=512M; support was removed in 8.0

 

Logging initialized using configuration  in  jar:file:/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/jars/hive-common-1.1.0-cdh5.13.1.jar!/hive-log4j.properties

WARNING: Hive CLI is deprecated and  migration to Beeline is recommended.

hive>

 

二:创建表

create table test(id int, name string,  age string, tel string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY  '\t' lines terminated by '\n'

STORED AS TEXTFILE

location '/user/bigdata/hive/test'; 

查看表是否存在:

hive> show tables;

OK

test

Time taken: 0.482 seconds, Fetched: 1 row(s)

hive> desc test

     > ;

OK

id                      int                                        

name                    string                                     

age                     string                                     

tel                     string                                     

Time taken: 0.455 seconds, Fetched: 4  row(s)

hive>

 

三:准备数据

[bigdata@cmdn01 hive]$ more test.data

1        world   20      13867981256

2        computer        30      13761294673

3        spxu    23      13761294674

[bigdata@cmdn01 hive]$ hdfs dfs -put  test.data /user/bigdata/hive/test/

[bigdata@cmdn01 hive]$ hdfs dfs -ls  /user/bigdata/hive/test/

Found 1 items

-rw-r--r--   3 bigdata bigdata         71 2018-04-23 17:44  /user/bigdata/hive/test/test.data

[bigdata@cmdn01 hive]$

 

四:查询数据

hive> select * from test;

OK

1        world   20      13867981256

2        computer        30      13761294673

3        spxu    23      13761294674

Time taken: 0.912 seconds, Fetched: 3  row(s)

hive> select name from test;

OK

world

computer

spxu

Time taken: 1.032 seconds, Fetched: 3  row(s)

hive>

 

五:用hue查看数据

url:http://cmhu01:8888

用bigdata/bigdata登陆


问题:spark-shell启动报错

现象:

Exception in thread "main"  java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

解决:

通过查看源代码与阿里云上安装的spark服务比对,hadoop_conf_dir环境变量的值为:

阿里云环境:

/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/conf/yarn-conf:/etc/hive/conf

待测环境:/etc/hadoop/conf

不同导致类路径加载不同。

解决的办法:需要在cm管理端在需要启动spark-shell的服务端服务器上部署gateway服务,

并启动部署客户端配置。

准确的spark配置是在/etc/spark/

conf   conf.cloudera.spark_on_yarn

进去到conf目录下,有如下文件:

classpath.txt   

__cloudera_generation__ 

__cloudera_metadata__ 

log4j.properties   

navigator.lineage.client.properties 

spark-defaults.conf 

spark-env.sh   

yarn-conf

 

   


以上是关于spark+hive在CDH5.13.1环境下基本使用的主要内容,如果未能解决你的问题,请参考以下文章

IDEA,SparkSql读取HIve中的数据

spark 2.x在windows环境使用idea本地调试启动了kerberos认证的hive

搭建hadoop+spark+hive环境(配置安装hive)

hadoop+hive+spark搭建

生产环境中的 Hive 与 Spark

spark--环境搭建--Hive0.13搭建