hive

Posted 2020-11-24 whywy

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive相关的知识，希望对你有一定的参考价值。

一、 hive流程 hive与hdfs 和 yarn、mr 交互动态

First : hive从 hdfs中拉取数据（ .txt文件）

Second : hive 与 SQL 交互。。。即获取模板信息

※ ：模板信息三个映射

A 表与文件

B 字段与文件内容

C 分割符 delimited

Third : hive 将sql语句转换成 mr任务

Fourth : mr 由 yarn 分配资源，运行结果数据

二、 hive 基本架构

1、 Hive中包含 Web UI ,,, Console UI ,,, Thrift Server ,,, Driver ,,, Metastore…………….

2、元数据存储在：数据库derby 和 metastore

3、 Compiler 编译：将sql 编译为job任务

4、 Optimizer 优化：优化job任务即计算aa.txt 文件

分布式缓存 -----》将mr任务所需要的数据在mr启动之前拉取过来。

减少io消耗 /// 节约时间

5、 Executor 执行：将 mr job任务分发给hadoop集群

※、编译器

1、定义：编译器将一个 Hive Query Language 转换操作符（sql语句）

2、操作符是hive 最小的处理单元

3、每个操作符代表 HDFS的一个操作

一道 MapReduce 作业

4、 hive定义一个处理过程为 Operator 作业

5、 9个，，，，，hive通过ExecMapper和ExceReduce执行MapReduce任务

操作符	描述
TableScanOperator	扫描全表数据
ReduceSinkOperator《Sink 下沉/写，，，》	将<K,V>发送到reduce端
JoinOperator	Join联合两份数据
SelectOperator	选择输出列
FileSinkOperator	将结果数据发送到结果
GroupByOperator	分组作业
MapJoinOperator	Map端联合
LimitOperator	Limit语句
UnionOperator	Union语句

三、 Hive数据类型

1、 基本数据类型 9个

TinyInt : 对错《0，1》。。。。男女《0，1》一个字节

SmalInt : 年龄《10，20，30，，，，，，》两个字节

Int :

BigInt : long

Float :

Double :

Boolean :

String :

TimeStamp : 时间戳 《map运行0%~100% 记录一下，，，reduce 运行0%~100% 记录一下，，，可以得出某一任务运行时间》

2、 复合数据类型 3个

Struct :

Map :

Array ；

四、hadoop中安装 mysql

1、 derby.log 和 metastore_db 记录当前路径的元数据。。。如果改变路径（启动hive）,,,,元数据丢失

可以将元数据记录在mysql 中，，，mysql相当于各个路径公有化，，即元数据属于全局

※ ：元数据 : 数据库/表结构，，

※ ： mysql安装：与hive 配置在不同机器上

※ ：搭建hive 环境： A >>> tar -zxvf apache-hive-1.2.0-bin.tar.gz

B >>> vim /etc/profile export HIVE_HOME=/root/Downloads/apache-hive-1.2.0-bin

export PATH=$PATH:$HIVE_HOME/bin

C >>> vim /root/Downloads/hadoop-2.6.5/etc/hadoop hadoop-env.sh

export HADOOP_USER_CLASSPATH_FIRST=true

D >>> hive 运行

2、 安装mysql相关命令：

查看是否安装： rpm –qa | grep (-i) mysql

删除： rpm -e --nodeps mysql-libs-5.1.71-1.el6.x86_64

安装服务端： rpm -ivh MySQL-server-5.5.47-1.linux2.6.x86_64.rpm

安装客户端： rpm -ivh MySQL-client-5.5.47-1.linux2.6.x86_64.rpm

启动mysql : service mysql start

设置密码： /usr/bin/mysql_secure_installation <一个n ，四个Y>

进入mysql ： myslq -uroot -p123456s

3、 设置远程访问mysql

1) Mysql安装服务端与hive 运行最好不再同一台机器上。。。每台机器上都需安装客户端

2) 服务端的进入mysql后，，，，/ show databases ;

/ use mysql ;

/ select * from user ;

/delete from user where host!=’localhost’;

/ update user set host=’%’ where host=’localhost’

/ grant all privileges on *.* to root@’%’ identifiend by ‘123465’;

/ flush privileges ;

3) 其他机器安装 mysql 的客户端。。。远程访问mysql ： mysql –uroot –p123456 –h服务端机器ip

4、建库建表建hdfs文件 hive中创库或建表时hdfs 中会自动生成对应的文件，在 usr/hive/warehouse/下

1) 目录 /root/Downloads/apache-hive-1.2.0-bin/conf 下

vim hive-site.xml

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://Mysql服务端ip地址:3306/数据库名（hive）?characterEncoding=UTF-8</value>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

</property>

</configuration>

2) 添加jar包

目录 /root/Downloads/apache-hive-1.2.0-bin/lib 下 mysql-connector-java-5.1.32.jar

※ ： mysql中只做查看命令。查看元数据。。在hive中运行增删改命令

3）先在mysql中创建库 hive ：。。。然后在启动hive。。。《hive必须知道元数据放在哪，所以得先启动mysql》

4) 在mysql中 use hive; show tables; ::: select * from 表; 查看元数据信息<库或表的信息>

a) DBS ：所有数据库的信息《库对应hdfs文件路径，库名，创建时间等》

b) TBLS ：所有表的信息《》

c) SDS ：表对应的hdfs文件路径，，，改变路径可自动映射加载数据

d) Partitions : 分区

e) Bucketing_cols ：分桶

5) desc tablename /// show create table tablename ::: 查看表结构

A ) Row Format Serde : 行格式化切分符的类 lazySimpleSerDe

B ) Store As InputFormat : map输入类 TextInputFormat

C ) OutPutFormat : hive 输出类，输出给map HiveIgnoreKeyTextOutPutFormat

D ) Location ：本地位置

6) 改变表中数据：

hive 与 hdfs 长连接，即一直连接着，所以处理速度很快

insert 添加数据

load data (local) inpath ‘文件位置’ into table 表名；加载本地文件

update 修改数据

delete 删除数据

※ ：建表时自定义分割符： create table 名（字段类型，，） row format delimited fields terminated by ‘ ’;

※ ： load 加载本地数据：

## load data local inpath ……… ：：：表示加载本地文件

## Load data inpath ………. ：：：表示加载上传到集群的文件

## ………’文件路径’ into table….：：：表示在原表数据后添加，。。。。即追加。。。

## ………’文件路径 ’ overwrite into ….：：：表示先清空原表数据，再添加。。。。即覆盖

※ 本地文件上传集群，，是剪切，，即文件位置改变

7) 表包括内部表和外部表

内部表；数据由hive自身管理，managed_table 删除表时hdfs中对应文件也消失。。默认就是内部表

外部表：数据由hdfs管理，external_table 删除表时hdfs中对应文件不会消失。。create external table 名( )；

SqlYoung 工具可以通过ip连接虚拟机的 mysql ，，，

use hive ； select * from SDS ; ….. 改变表所对应的路径可将文件夹数据映射到表中。。。

8) DDL ：数据定义语言。。。。对表结构的处理

## 改表名： alter table 名 rename to 新名；

## 修改字段： alter table 名 change column 字段名新字段名数据类型；

## 添加字段： alter table 名 add columns （字段名数据类型）；

## 删除字段： alter table 名 replace columns (最终显示的所有字段) ；

9) DQL/DML ：数据查询/操作语言。。。。对表中数据的操作

## group by : 分组-----》造成数据倾斜，各组数据大小不一致运行时间不一致

A) 打散：set hive.groupby.skewindata=true;

B) 聚合： set hive.mapred.aggr=true;

C) 组间数据之差在十万之间 ： set hive.groupby.mapaggr.checkinterval=100000;

计算机自身会自动进行优化，100000条数据不会产生数据倾斜

D) 需要聚合的数据 / 总数据=0.5 <阀值>: set hive.map.aggr.hash.min.reduction=0.5;

E) 分组之后用 having 过滤

## order by 排序 limit 分页：

A) mr中的二次排序不是全局排序

B) order by 是全局排序

C) 全局排序要求数据不大于128M order by 与 limit 联用

设置严格模式 set hive.mapred.mode=strict;

## sort by 局部排序

distribute by 把每个列的数据分给不同的reduce

设置reduce 数量 set mapreduce.job.reduces=3;

## cluster by 按照一个字段分发，且只能按照这个字段排序

## union 联合两表自动去重

union all 联合两表，不去重

## distinct 去重

## 表关联：

A) 内连接： inner join /// where /// join

B) 外连接： left (outer ) join /// right (outer) join

C) 全连接： full join

D) left semi join : 以一个表为基准表能够避免笛卡儿积出现提高运行效率内部优化

E) map join ：标志是将小文件加载到内存中内部优化

1） Select /*+MAPJOIN(小表)*/* from 表1 a join 表2 b on a.no=b.no;

2）设置自动转化 set hive.auto.convert.join=true;

设置文件大小(小于25M) set hive.mapjoin.smalltable.filesize=25000000;

Select * from 表1 a join 表2 b on a.no=b.no;

## 分区

A) 将表划分为多个区域。。。。使得查询速度更快更便捷

分区的条件将在hdfs中生成一个文件夹

………….partitioned by(a int ,b int ,c int) ……… ：：：a文件夹包含b文件夹包含c文件夹

/usr/hive/warehouse/库名/表名/a/b/c/数据：：：/user/hive/warehouse/hello.db/part/ year=2018/month=3/day=2/..

设置非严格模式 set hive.mapred.mode=unstrict;

B) 单条件分区

create table part(id int,temp int,hour int) partitioned by (day int ) row format delimited fields terminated by ‘ ‘;

load data local inpath ‘/hadoop/part.txt’ into table part partition(day=10);

C) 多条件分区

create table part(id int,temp int,hour int) partitioned by (day int,month int) row format delimited fields terminated by ‘ ’；

load data local inpath ‘/hadoop/part.txt’ into table part partition(day=10,month=10);

D)动态分区

开启动态分区： set hive.exec.dynamic.partition=true;

需要一个基本表存放所有数据 format。。。另一个表实现动态分区不需要format

create table basic(id int,temp int,hour int,day int,month int) row format delimited fields terminated by ‘ ‘ ;

load data local inpath ‘/hadoop/basic.txt’ into tbale basic ;

create table part (id int,temp int,hour int) partitioned by(day int,month int);

insert into part partition (day,month) select id,temp,hour,day,month from basic;

E) 混合分区

两个条件，，，一个固定一个随机

## 分桶避免笛卡儿积的出现，提高运行效率

A) 先分区，再每个区里再分桶，，，根据id的hash值，，，id % 3……..

每个区文件夹里包含多个分桶后的文件

分桶的数量和reduce端的数量一致

设置非严格模式 set hive.exec.dynamic.partition.mode=unstrict;

开启动态分区： set hive.exec.dynamic.partition=true;

设置分桶：set hive.enforce.bucketing=true;

create table basic (id int,name String , sex int) row format delimited fields terminated by ‘ ‘;

load data local inpath ‘/hadoop/basic.txt’ into table basic

create table buck (id int,name String ) partitioned by (sex int) clustered by (id) into 3<分桶数量> buckets;

insert into buck partition(sex) select id,name,sex from basic;

B) 未分区，，直接分桶

设置分桶：set hive.enforce.bucketing=true;

create table basic (id int,name String , sex int) row format delimited fields terminated by ‘ ‘;

load data local inpath ‘/hadoop/basic.txt’ into table basic

create table buck (id int,name String ,sex int) cluster by (id) into 3 buckets;

insert into buck select id,name,sex from basic;

※ : hive中查看 dfs 文件

hive > dfs –text path;

hive> dfs –cat path/000000_0;

## 索引
A) 作用：快速查找数据 /// 减轻namenode压力

B) 将hdfs中的路径抽离到hive本地保存，保存再hive中的一个表中，hive对元数据进行封装，然后再

由namenode 记录；；；<namenode还是记录元数据，只是hive进行封装减少文件数量减轻其压力>

索引文件，封装元数据路径，记录元数据的位置信息，可以有多个文件

u ：索引文件是有序的

C) 步骤： create index 索引名 on table 表名(字段id) 指明索引名称、表、描述种类

as “org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler” 一个工具类，生成一个文件/表记录相关位置，可以将文件排序;

win deferred rebuild ; 先规划一个索引，不创建。。。

show index on 表名;

索引名表名表中字段（索引条件字段）存放位置信息的表名类

one basic id hello__basic_one__ compact

alter index 索引名 on 表名 rebuild ; 真正创建索引

得到一个表：：：Loading data to table hello.hello__basic_one__

Select * from hello.hello__basic_one__; 查看表中元数据位置信息

Id(索引字段) 元数据位置信息偏移量

1 hdfs://Linux01:9000/user/hive/warehouse/hello.db/basic/basic.txt [0]

2 hdfs://Linux01:9000/user/hive/warehouse/hello.db/basic/basic.txt [16]

3 hdfs://Linux01:9000/user/hive/warehouse/hello.db/basic/basic.txt [32]

4 hdfs://Linux01:9000/user/hive/warehouse/hello.db/basic/basic.txt [48]

5 hdfs://Linux01:9000/user/hive/warehouse/hello.db/basic/basic.txt [64]

## 复杂数据类型

A) Struct : K-v 对，，，，，Key固定

指定字段之间分割符指定 struct 中数据的分割符

create table str(id int ,name String ,hobby struct<a:String,b:String,c:String>)

row format delimited fields terminated by ‘ ‘

collection items terminated by ‘,’;

B) Map : 指定字段之间分割符指定map中数据的分割符指定map key和value的分割符

create table arr_table (id int , name String , hobby map<String,String>)

row format delimited fields terminated by ‘ ‘

collection items terminated by ‘,’

map keys terminated by ‘:’;

C) Array : 指定字段分割符指定array中数据之间分割符

create table arr_table (id int , name String , star Array<String>)

row format delimited fields terminated by ‘ ‘

collection items terminated by ‘,’;

## hive中的join

A) 一个查询结果，有字段有数据，可以做为一个表，，，从里面读取数据并进行筛选

B) select * from (select col1 from tablename group by col2)tableAlisa where col1=值；

C) 求每个部门名称和对应的平均工资

select t1.depnum,t2.avg from dept t1

join (select avg(salary) avg,deptno from emp group by deptno )t2

on t1.deptno=t2.deptno;

D) 求谁的工资比tom高

select * from emp t1 join (select salary from emp where ename=tom)t2 on t1.depno=t2.depno;

E) 求某部门有哪些职位

select t1.job , t2.deptnum , (select deptno from emp where deptnum=’yanfabu’)t2 where t1.deptno=t2.deptno;

## row_number() over(order by 字段)

A) 显示行号并按某一字段排序即有序显示行号

以上是关于hive的主要内容，如果未能解决你的问题，请参考以下文章

HiveHive 基础

Hivehive函数与hive shell

HiveHive 一些面试题

HiveHive Metrics体系