spark+hive在CDH5.13.1环境下基本使用
Posted 徐大嘴
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark+hive在CDH5.13.1环境下基本使用相关的知识,希望对你有一定的参考价值。
spark使用
可操作终端:cmdn01/cmdn02/cmdn03
前提:
操作用户:bigdata 用户目录:/home/bigdata/
Spark数据目录:/home/bigdata/spark
Hive数据目录:/home/bigdata/hive
权限:对bigdata/spark两个目录,其他用户具有可执行权限
样例:单词计数-wordCount
一:准备数据
[bigdata@cmdn01 spark]$ more wordcount.txt company header world world beijing computer tailer converse [bigdata@cmdn01 spark]$ |
二:启动spark-shell
[bigdata@cmdn01 spark]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc (master = yarn-client, app id = application_1524465049561_0007). SQL context available as sqlContext.
scala> |
三:操作如下:
定义文件: scala> val wordcount = sc.textFile("file:///home/bigdata/spark/wordcount.txt") 计算单词个数: scala> val counts = wordcount .flatMap(line => line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b) 结果写入HDFS scala> counts.saveAsTextFile("/user/bigdata/wordcount") |
四:查看结果文件
[bigdata@cmdn01 spark]$ hdfs dfs -ls /user/bigdata/wordcount Found 3 items -rw-r--r-- 3 bigdata bigdata 0 2018-04-23 17:02 /user/bigdata/wordcount/_SUCCESS -rw-r--r-- 3 bigdata bigdata 19 2018-04-23 17:02 /user/bigdata/wordcount/part-00000 -rw-r--r-- 3 bigdata bigdata 0 2018-04-23 17:02 /user/bigdata/wordcount/part-00001 [bigdata@cmdn01 spark]$ hdfs dfs -cat /user/bigdata/wordcount/part-00000 (word,2) (world,1) |
五:用hue查看
url:http://cmhu01:8888
用bigdata/bigdata登陆
hive使用
一:启动hive
[bigdata@cmdn01 spark]$ hive Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/jars/hive-common-1.1.0-cdh5.13.1.jar!/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> |
二:创建表
create table test(id int, name string, age string, tel string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' lines terminated by '\n' STORED AS TEXTFILE location '/user/bigdata/hive/test'; |
查看表是否存在:
hive> show tables; OK test Time taken: 0.482 seconds, Fetched: 1 row(s) hive> desc test > ; OK id int name string age string tel string Time taken: 0.455 seconds, Fetched: 4 row(s) hive> |
三:准备数据
[bigdata@cmdn01 hive]$ more test.data 1 world 20 13867981256 2 computer 30 13761294673 3 spxu 23 13761294674 [bigdata@cmdn01 hive]$ hdfs dfs -put test.data /user/bigdata/hive/test/ [bigdata@cmdn01 hive]$ hdfs dfs -ls /user/bigdata/hive/test/ Found 1 items -rw-r--r-- 3 bigdata bigdata 71 2018-04-23 17:44 /user/bigdata/hive/test/test.data [bigdata@cmdn01 hive]$ |
四:查询数据
hive> select * from test; OK 1 world 20 13867981256 2 computer 30 13761294673 3 spxu 23 13761294674 Time taken: 0.912 seconds, Fetched: 3 row(s) hive> select name from test; OK world computer spxu Time taken: 1.032 seconds, Fetched: 3 row(s) hive> |
五:用hue查看数据
url:http://cmhu01:8888
用bigdata/bigdata登陆
问题:spark-shell启动报错
现象:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream |
解决:
通过查看源代码与阿里云上安装的spark服务比对,hadoop_conf_dir环境变量的值为: 阿里云环境: /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/conf/yarn-conf:/etc/hive/conf 待测环境:/etc/hadoop/conf 不同导致类路径加载不同。 解决的办法:需要在cm管理端在需要启动spark-shell的服务端服务器上部署gateway服务, 并启动部署客户端配置。 准确的spark配置是在/etc/spark/ conf conf.cloudera.spark_on_yarn 进去到conf目录下,有如下文件: classpath.txt __cloudera_generation__ __cloudera_metadata__ log4j.properties navigator.lineage.client.properties spark-defaults.conf spark-env.sh yarn-conf |
以上是关于spark+hive在CDH5.13.1环境下基本使用的主要内容,如果未能解决你的问题,请参考以下文章
spark 2.x在windows环境使用idea本地调试启动了kerberos认证的hive