spark+hive在CDH5.13.1环境下基本使用

Posted 2021-04-29 徐大嘴

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了spark+hive在CDH5.13.1环境下基本使用相关的知识，希望对你有一定的参考价值。

spark使用
可操作终端：cmdn01/cmdn02/cmdn03
前提：
操作用户：bigdata 用户目录:/home/bigdata/
Spark数据目录:/home/bigdata/spark
Hive数据目录:/home/bigdata/hive
权限：对bigdata/spark两个目录，其他用户具有可执行权限

样例：单词计数-wordCount

一：准备数据

[bigdata@cmdn01 spark]$ more wordcount.txt

company header world

world beijing

computer tailer converse

[bigdata@cmdn01 spark]$

二：启动spark-shell

[bigdata@cmdn01 spark]$ spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel).

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.6.0

/_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)

Type in expressions to have them evaluated.

Type :help for more information.

Spark context available as sc (master = yarn-client, app id = application_1524465049561_0007).

SQL context available as sqlContext.

scala>

三：操作如下：

定义文件：

scala> val wordcount = sc.textFile("file:///home/bigdata/spark/wordcount.txt")

计算单词个数：

scala> val counts =

wordcount .flatMap(line => line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)

结果写入HDFS

scala> counts.saveAsTextFile("/user/bigdata/wordcount")

四：查看结果文件

[bigdata@cmdn01 spark]$ hdfs dfs -ls /user/bigdata/wordcount

Found 3 items

-rw-r--r-- 3 bigdata bigdata 0 2018-04-23 17:02 /user/bigdata/wordcount/_SUCCESS

-rw-r--r-- 3 bigdata bigdata 19 2018-04-23 17:02 /user/bigdata/wordcount/part-00000

-rw-r--r-- 3 bigdata bigdata 0 2018-04-23 17:02 /user/bigdata/wordcount/part-00001

[bigdata@cmdn01 spark]$ hdfs dfs -cat /user/bigdata/wordcount/part-00000

(word,2)

(world,1)

五：用hue查看

url：http://cmhu01:8888

用bigdata/bigdata登陆

hive使用

一：启动hive

[bigdata@cmdn01 spark]$ hive

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/jars/hive-common-1.1.0-cdh5.13.1.jar!/hive-log4j.properties

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive>

二：创建表

create table test(id int, name string, age string, tel string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' lines terminated by '\n'

STORED AS TEXTFILE

location '/user/bigdata/hive/test';

查看表是否存在：

hive> show tables;

test

Time taken: 0.482 seconds, Fetched: 1 row(s)

hive> desc test

> ;

id int

name string

age string

tel string

Time taken: 0.455 seconds, Fetched: 4 row(s)

hive>

三：准备数据

[bigdata@cmdn01 hive]$ more test.data

1 world 20 13867981256

2 computer 30 13761294673

3 spxu 23 13761294674

[bigdata@cmdn01 hive]$ hdfs dfs -put test.data /user/bigdata/hive/test/

[bigdata@cmdn01 hive]$ hdfs dfs -ls /user/bigdata/hive/test/

Found 1 items

-rw-r--r-- 3 bigdata bigdata 71 2018-04-23 17:44 /user/bigdata/hive/test/test.data

[bigdata@cmdn01 hive]$

四：查询数据

hive> select * from test;

1 world 20 13867981256

2 computer 30 13761294673

3 spxu 23 13761294674

Time taken: 0.912 seconds, Fetched: 3 row(s)

hive> select name from test;

world

computer

spxu

Time taken: 1.032 seconds, Fetched: 3 row(s)

hive>

五：用hue查看数据

url：http://cmhu01:8888

用bigdata/bigdata登陆

问题：spark-shell启动报错

现象：

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

解决：

通过查看源代码与阿里云上安装的spark服务比对，hadoop_conf_dir环境变量的值为：

阿里云环境：

/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/conf/yarn-conf:/etc/hive/conf

待测环境：/etc/hadoop/conf

不同导致类路径加载不同。

解决的办法：需要在cm管理端在需要启动spark-shell的服务端服务器上部署gateway服务，

并启动部署客户端配置。

准确的spark配置是在/etc/spark/

conf conf.cloudera.spark_on_yarn

进去到conf目录下，有如下文件：

classpath.txt

__cloudera_generation__

__cloudera_metadata__

log4j.properties

navigator.lineage.client.properties

spark-defaults.conf

spark-env.sh

yarn-conf

以上是关于spark+hive在CDH5.13.1环境下基本使用的主要内容，如果未能解决你的问题，请参考以下文章