HIVE 必知必会

Posted wqbin

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HIVE 必知必会相关的知识,希望对你有一定的参考价值。

hive:
基于hadoop,数据仓库软件,用作OLAP

OLAP:online analyze process 在线分析处理
OLTP:online transaction process 在线事务处理

事务:
ACID

A:atomic 原子性
C: consistent 一致性
I:isolation 隔离性
D: durability 持久性

 

1读未提交
  脏读 //事务一写入数据,事务二进行读取,事务一进行回滚
2读已提交
  不可重复读 //事务一写入数据并提交,事务二读取,事务一进行update操作,事务二读取的数据不一致
4可重复读
  幻读 //在mysql中见不到
8串行化

默认事务隔离级别 => mysql 4
=> oracle 2 , oracle没有4


RDBMS:关系型数据库管理系统

范式:
第一范式:无重复的列,一列只能包含一个字段
第二范式:主键约束,一行只能被唯一标识
第三范式:非主键字段要严格依赖于主键字段


--------------------------------------------------
id info name age classroom aircondition fans chair
--------------------------------------------------
1 tom,20,married
--------------------------------------------------
1
--------------------------------------------------


Hive数据仓库:反范式设计,允许甚至推荐冗余

 


安装centos版本的mysql:
===================================
1、将hive相关软件下的文件或文件夹均拷贝到/home/centos下

  技术图片

 

2、先安装centos版的mysql
  1)安装mysql的仓库镜像
  sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm

  2)安装mysql
  cd mysql
  sudo yum -y localinstall *

  3)启动并修改配置mysql密码
  centos> systemctl start mysqld
  centos> systemctl enable mysqld
  centos> mysql -uroot //进入mysql中

  mysql> update mysql.user set password=password(‘root‘); //设置mysql密码为root

  mysql> flush privileges; //刷新权限列表

  4)退出mysql并重新进入
  mysql> exit
  centos> mysql -uroot -proot

 

3、安装hive:
  1)解压
  tar -xzvf apache-hive-2.1.1-bin.tar.gz -C /soft/

  2)符号链接
  cd /soft/
  ln -s apache-hive-2.1.1-bin/ hive

  3)环境变量
  # hive环境变量
  export HIVE_HOME=/soft/hive
  export PATH=$PATH:$HIVE_HOME/bin

  4)生效环境变量
  source /etc/profile

  5)修改hive配置文件:

    1.重命名所有的template文件后缀去掉
    rename ‘.template‘ ‘‘ *.template

    2.修改hive-env.sh文件,添加
    HADOOP_HOME=/soft/hadoop

    3.重命名hive-default.xml文件为hive-site.xml
    mv hive-default.xml hive-site.xml

    4.修改hive-site.xml //替换自己的hive-site.xml
      4.5、修改hive-site.xml中的$system:user.name和$system:java.io.tmpdir //可选
      sed -i ‘[email protected]$system:user.name@[email protected]‘ hive-site.xml
      sed -i ‘[email protected]$system:java.io.tmpdir@/home/centos/[email protected]‘ hive-site.xml

    5.拷贝mysql驱动到hive下
    cp ~/mysql-connector-java-5.1.44.jar /soft/hive/lib/

    6.在mysql中创建数据库hive
    create database hive;

    7.初始化元数据库
    schematool -initSchema -dbType mysql

4、体验hive
    输入 hive 进入命令行

 

 

==========================================================hive使用==============================================================

启动hive的顺序
  1、启动zk
  2、启动hadoop(hdfs+MR)
  3、启动hive


hive数据类型:
  int
  string
  bigint

  array => [‘jenny‘,‘marry‘] //array<String>

  colmun, fruit
  struct => 1, ‘apple‘ //事先定义好字段名称 struct<column int, fruit string>

  int: string
  map => 1:‘apple‘ //事先定义好k-v类型 map<int,string>


  unoin => ‘tom‘:[‘jenny‘,‘marry‘] //联合类型,此例子是map和array的嵌套

 

hive建表指定分隔符:
==========================================================================
create table users2 (id int, name string, age int) row format delimited fields terminated by ‘\\t‘;

 

row format delimited+
分隔符类型:
fields terminated by ‘\\t‘     //字段分隔符
collection items terminated by ‘,‘   //array类型成员分隔符
map keys terminated by ‘:‘
lines terminated by ‘\\n‘   //行分隔符,必须放在最后

 

 

hive插入数据不是使用insert,而是load:
=======================================================================
批量插入数据

1、先定义好表结构
  create table employee(name string, work_place array<string>, sex_age struct<sex:string, age:int> , score map<string, int>, depart_title map<string, string>) row format delimited fields terminated by ‘|‘ collection items terminated by ‘,‘ map keys terminated by ‘:‘ lines terminated by ‘\\n‘ stored as textfile;

2、准备数据。数据格式要和表结构对应 employee.txt
  Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer
  Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
  Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
  Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

3、空表中使用load命令加载数据
  load data local inpath ‘/home/centos/employee.txt‘ into table employee;


4、取出所有的成员
  select name ,work_place[0] from employee; array获取
  select name ,sex_age.sex from employee; 结构体获取
  select name, score[‘python‘] from employee; map成员获取

 

 

插入复杂类型数据insert
===============================

1.设置显示表头
  set hive.cli.print.header=true; //设置显示表头,临时修改,想要永久修改需要修改配置文件

2.插入复杂类型使用转储:insert xxx select xxx

3.通过select语句构造某类型

  select array(‘tom‘,‘tomas‘,‘tomson‘) ; //通过select语句构造出array类型
  insert into employee(name,work_place) select ‘tom‘, array(‘beijing‘,‘shanghai‘,‘guangzhou‘) ;


  select map(‘bigdata‘,100); //通过select语句构造出map类型
  insert into employee(name,score) select ‘tomas‘, map(‘bigdata‘,100);

  select struct(‘male‘,10);
  select named_struct(‘sex‘,‘male‘,‘age‘,10); //通过select语句构造出struct类型
  insert into employee(name,sex_age) select ‘tomson‘,named_struct(‘sex‘,‘male‘,‘age‘,10);

 

load命令详解:
=================================
1、load本地数据
  load data local inpath ‘/home/centos/employee.txt‘ into table employee; //相当于上传或者复制,源文件不变

2、load hdfs数据
  load data inpath ‘/duowan_user.txt‘ into table duowan; //相当于移动

3、load本地数据 + 覆盖原始数据
  load data local inpath ‘/home/centos/employee.txt‘ overwrite into table employee;

4、load hdfs数据 + 覆盖原始数据
  load data inpath ‘/duowan_user.txt‘ overwrite into table duowan;

 

 

 

duowan数据处理:
====================================
//创建表
  create table duowan(id int, name string, pass string, mail string, nickname string) row format delimited fields terminated by ‘\\t‘ lines terminated by ‘\\n‘ stored as textfile;


//加载数据
  load data inpath ‘/duowan_user.txt‘ into table duowan;


//开始执行
  select pass , count(*) as count from duowan group by pass order by count desc limit 10;

//设置reduce个数
  set mapreduce.job.reduces=2;

 

 

hiveserver2

//hive的jdbc接口,用户可以连接此端口来连接hive服务器
=======================================

hive jdbc //端口10000
web页面 //10002

通过jdbc连接hiveserver2

org.apache.hive.jdbc.HiveDriver //jdbc驱动类

问题:java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException:User: centos is not allowed to impersonate anonymous

解决:hive.server2.enable.doAs=false //hive-site.xml配置文件禁用连接验证

 

//不再使用hive直接使用hiveserver2//再在s201新界面输入   beeline -u jdbc:hive2://localhost:10000

如何使用新hive端口在ide上
1、启动hiveserver2 服务端
  hiveserver2

2、编程
//查询数据使用ide

 1 public static void main(String[] args) throws Exception 
 2 Class.forName("org.apache.hive.jdbc.HiveDriver");
 3 String url = "jdbc:hive2://192.168.23.101:10000/big12";
 4 Connection conn = DriverManager.getConnection(url);
 5 Statement st = conn.createStatement();
 6 ResultSet rs = st.executeQuery("select * from duowan limit 20");
 7 while(rs.next())
 8 int id = rs.getInt(1);
 9 String name = rs.getString(2);
10 String pass = rs.getString(3);
11 String mail = rs.getString(4);
12 String nickname = rs.getString(5);
13 System.out.println(id + "/" + name + "/" + pass + "/" + mail + "/" + nickname);
14 
15 rs.close();
16 st.close();
17 conn.close();
18 
19 
20 //插入数据
21 @Test
22 public void testInsert() throws Exception
23 Class.forName("org.apache.hive.jdbc.HiveDriver");
24 String url = "jdbc:hive2://192.168.23.101:10000/big12";
25 Connection conn = DriverManager.getConnection(url);
26 Statement st = conn.createStatement();
27 boolean b = st.execute("insert into users values(2,‘tomas‘,30)");
28 if(b)
29 System.out.println("ok");
30 
31 

 

3、使用新一代客户端beeline连接hiveserer2
  beeline -u jdbc:hive2://localhost:10000

  

4.问题解决:
  问题:host 192.168.23.1 is not allowed to connect to mysql server

  解决:在mysql中输入
  grant all privileges on *.* to ‘root‘@‘192.168.23.1‘ identified by ‘root‘;
  flush privileges;

 

使用beeline进行操作:
====================================
1、join
1)customer
  create table customers(id int, name string, age int) row format delimited fields terminated by ‘\\t‘;
2)orders
  create table orders(oid int, oname string, oprice float, uid int) row format delimited fields terminated by ‘\\t‘;

3)创建并加载数据
  load data local inpath ‘‘ into table xxx;

4)使用join:
//内连接
  select a.id, a.name, b.oname, b.oprice from customers a inner join orders b on a.id=b.uid;
//左外连接
  select a.id, a.name, b.oname, b.oprice from customers a left outer join orders b on a.id=b.uid;
//右外连接
  select a.id, a.name, b.oname, b.oprice from customers a right outer join orders b on a.id=b.uid;
//全外连接
  select a.id, a.name, b.oname, b.oprice from customers a full outer join orders b on a.id=b.uid;

 

 

2、wordcount

1)array => collection items terminated by ‘ ‘ => explode()
  explode => array

  1.创建表wc
  create table wc(line array<string>) row format delimited collection items terminated by ‘ ‘;

  2.加载数据
  load xxx

  3.编写sql语句
  select word, count(*) as count from (select explode(line) word from wc ) a group by word order by count desc;


2)line string => split()
  split => select split(‘hello world‘, ‘ ‘) => ["hello","world"]

  1.创建表wc2
  create table wc2(line string) row format delimited;

  2.加载数据
  load xxx

  3.编写sql语句
  select word , count(*) as count from (select explode(split(line,‘ ‘)) as word from wc2) a group by word order by count desc;

 

3、最高气温统计

  substr(line,16,4) => 1901
  substr(line,88,5) => +0010

  cast(temp as int);

1)创建temp表
  create table temp(line string) ;

2)加载temp数据到表中
  load data xxx

3)select year, max(temp) from (select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp) a where temp != 9999 group by year;

 

 

本地模式:
================================================================
  让hive自动使用hadoop的本地模式运行作业,提升处理性能

  set hive.exec.mode.local.auto=true;

 

  hive:
  !sh ls -l /home/centos => 在hive命令行执行shell语句

  dfs -ls / ;

 

alter 修改

================================================================

  alter table wc rename to wc3; //修改表名

  alter table user_par add partition(province=‘beijing‘,city=‘beijing‘);//添加分区

  alter table customers drop column wife; //删除列,不能用!!!!!
  alter table customers replace columns(wife1 string, wife2 string, wife3 string); //替换所有列
  alter table customers change column id no string; //将id列改名为no,类型为string

 

 

hive分区:
=========================================
partition,是一个字段,是文件夹

前提:在hive进行where子句查询的时候,会将条件语句和全表进行比对,搜索出所需的数据,性能极差:避免全表扫描


0:建立user_par
create table user_par(id int, name string, age int) partitioned by(province string, city string)row format delimited fields terminated by ‘\\t‘;

1.1、添加分区
alter table user_par add partition(province=‘beijing‘,city=‘beijing‘);
1.2、将数据加载到指定分区(分区可以不存在)
load data local inpath ‘/home/centos/customers.txt‘ into table user_par partition (province=‘shanxi‘,city=‘taiyuan‘);
1.3、将表清空
truncate table user_par;

or

1.1、插入数据动态指定分区
//设置动态分区非严格模式,无需指定静态分区
set hive.exec.dynamic.partition.mode=nonstrict;
insert into user_par partition(province,city) select * from user_nopar;

2、建立user_nopar
create table user_nopar(id int, name string, age int, province string, city string) row format delimited fields terminated by ‘\\t‘;

 

SQL => struct query language

HQL => hive query language 类sql,和sql语句差别不大

no sql => not only sql 不仅仅是sql,和sql语句差距较大


HQL语句和MR执行过程的对应

select

==============================================join=====partition=====bucket==========================================================================

join:

========================================================================
select a.id, a.name, b.orderno, b.oprice from customers a inner join orders b on a.id=b.cid;

a inner join b //返回行数 a ∩ b

a left [outer] join b //返回行数 a

a right [outer] join b //返回行数 b

a full [outer] join b //返回行数 a+b - (a ∩ b)

a cross join b //返回行数 a * b

 

 

特殊join:
====================================================================================
1.map join

小表+大表 => 将小表加入到分布式缓存,通过迭代大表所有数据进行处理

在老版的hive中(0.7)之前,所有的join操作都是在reduce端执行的(reduce 端join)
想要进行map端join,需要进行以下操作
SET hive.auto.convert.join=true;
声明暗示 a join b , a小表,b大表
/*+ mapjoin(小表) */
SELECT /*+ MAPJOIN(a) */ a.id, a.name, b.orderno, b.oprice from customers a inner join orders b on a.id=b.cid;


在新版hive中,如果想要进行map端join

jdbc:hive2://> SET hive.auto.convert.join=true; //设置自动转换成map端join
jdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000; //设置map端join中小表的最大值

 

2.common join
即reduce端join
  1)、声明暗示,指定大表
  /*+ STREAMTABLE(大表) */

2)、将大表放在右侧

 

 

测试:customers和orders

  //不写任何暗示,观察是map端join还是reduce join
  SELECT a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;

//写暗示,观察效果
SELECT /*+ MAPJOIN(a) */ a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;

//将自动转换map join 设置成false
SET hive.auto.convert.join=false;

//写reduce端join的暗示,观察结果
SELECT /*+ STREAMTABLE(a) */ a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;

 

 

partition:
=============================================================================== 
文件夹形式存在,为了避免查询时的全表扫描

0.1  create table user(id int, name string, age int);

0.2  create table user2(id int, name string, age int) partitioned by (province string, city string);


1、手动创建分区
  alter table user add partition(province=‘beijing‘,city=‘beijing‘);

2、load数据到分区
  load data local inpath ‘‘ into table user partition(province=‘beijing‘,city=‘beijing‘);

3、插入数据动态创建分区
  //设置动态分区非严格模式
  set hive.exec.dynamic.partition.mode=nonstrict
  insert into user2 partition(province, city) select * from user;

  注意:在动态插入分区字段时注意,字段顺序必须要和分区顺序保持一致,和字段名称无关

4、删除分区数据
  alter table user_par2 drop partition(province=‘chengdu‘);

5、insert数据到分区表
  insert into user_par2 partition(province=‘USA‘, city=‘NewYork‘) select 10,‘jerry‘,30;

6、查看指定表的分区列表:
  show partitions user_par2;

7、如何创建分区
  1)以日期或时间进行分区 比如year, month, and day
  2)以位置进行分区 比如Use country, territory, state, and city
  3)以业务逻辑进行分区

 

bucket:桶表
==================================================================
新型数据结构,以‘’文件段‘’的形式在分区表内部按照指定字段进行分隔---把分区表继续分成更小的文件 根据hashcode随机分配

重要特性:优化join的速度,避免全区扫描

//创建分区分桶表   分区在前分桶在后

  create table user_bucket(id int, name string, age int) CLUSTERED BY (id) INTO 2 BUCKETS row format delimited fields terminated by ‘\\t‘;

//在桶表中转储数据
  insert into user_bucket select id, name , age from user_par2;

//查看hdfs中桶表的数据结构-------分区下又分成更小的文件

//将桶表和分区表一同使用建立新表user_new, 分区在前
  create table user_new(id int, name string, age int) partitioned by (province string, city string) CLUSTERED BY (id) INTO 2 BUCKETS row format delimited fields terminated by ‘\\t‘;

//查看hdfs中桶表的数据结构-------探讨

    step0:create table user_new(id int, name string, age int) partitioned by (province string, city string) CLUSTERED BY (id) INTO 2 BUCKETS row format delimited fields terminated by ‘\\t‘;

    step1:load data local inpath ‘/home/centos/customers.txt‘ into table user_new partition(province=‘beijing‘,city=‘beijing‘);

    step2:...查看hdfs...发现在(province=‘beijing‘,city=‘beijing‘)partition文件夹下只有一个文件。。。并没有自动分桶

    结论:load并不会修改表中的数据结构,在桶表中的体现,就是没有将数据进行分段

//insert 数据
  insert into user_new partition(province=‘USA‘, city=‘NewYork‘) select 10,‘jerry‘,30;

 

//如何指定分桶字段
    通过join字段进行桶字段的确定,在以下场景中分桶字段a => no , b => uid

技术图片

 

 

 

内部表&外部表

=======================================================
内部表:
  删除表 => 删除元数据,删除真实数据
  MANAGED_TABLE 也叫托管表
  默认表类型

 


外部表:
  删除表 => 只删除元数据,不删除真实数据
  场景:为了防止drop或者truncate表的时候数据丢失的问题
  external table

create external table user_external as select * from user_par;

drop  table user_externa;

打开hdfs。。发现数据文件还在

 

 


ctas:
=====================================================================
create table user_par2 like user_par; //创建user_par2,与user_par表结构一致,但是没有数据

create table user_par3 as select * from user_par; //创建user_par2,与user_par完全一致,包括数据
//但是分区会被变为字段

 

 

 

查看函数的扩展信息:

=====================================================================

desc function extended xxxxxxx;
   eg:desc function extended format_number;

 

 

 

时间函数:
=====================================================================
select current_database() //当前数据库。。。。无语

select current_date() //当前日期

select current_timestamp() //当前时间戳,精确到毫秒

select date_format( current_timestamp(), ‘yyyy-MM-dd HH:mm:ss‘); //将时间格式化

select unix_timestamp(current_date()) //将日期转换成时间戳,精确到秒

select from_unixtime(153361440000, ‘yyyy-MM-dd‘); //将时间戳转化成日期

select datediff(‘2018-03-01‘,‘2018-02-01‘); //计算两个指定日期相差多少天

 

 


字符串函数
=====================================================================
select split(‘hello world‘,‘ ‘); => array()

select substr(‘hello world‘, 7); => world

select substr(‘hello world‘, 7,12); => world

select trim(‘ world‘); => 去掉前后的空格

format_number()

    //select format_number(1234.345,2); => 1,234.35

  //select format_number(1234.345,‘000000.00‘); => 001234.35
  //select format_number(1234.345,‘########.##‘); => 1234.35

 

select concat(‘hello‘, ‘ world‘);    //拼串操作,返回 hello world

select length(‘helloworld‘)         //10

 

 

 

条件语句
=====================================================================
select if( age > 10 ,‘old‘, ‘young‘) from user_par; //相当于三元运算符。
  //第一个表达式成立返回第二个表达式
  //第一个不成立返回第三个表达式

select case when age<20 then ‘young‘ when age<40 then ‘middle‘ else ‘old‘ end from user_par;

  //小于20,返回young
  //小于40,返回middle

 

select case  age when 20 then ‘20岁‘ when 40 then ‘40岁‘ else ‘其他岁‘ end from user_par; 


 

 

排序:

order by //全排序
sort by //部分排序
distribute by //相当于hash分区,但是没有排序
cluster by //sort by + distribute by

=============================================================================================
//建表--- user_order 无分区
  create table user_order(id int, name string, age int, province string, city string) row format delimited fields terminated by ‘\\t‘;

//设置reduce个数
  set mapreduce.job.reduces=2;


order by
全排序 //强制一个reduce,在真实使用中,需要加limit限制。
  truncate table user_order;
  insert into user_order select * from user_par order by id;

  最后输出文件个数为2 =reduce


sort by
部分排序 //在每个reduce中分别排序
  truncate table user_order;

  set mapreduce.job.reduces=2;
  insert into user_order select * from user_par sort by id;

  最后输出文件个数为2 =reduce


distribute by
hash分区 // 根据id hash分区 输出文件不排序
  truncate table user_order;

  set mapreduce.job.reduces=2;
  insert into user_order select * from user_par distribute by id;

  最后输出文件个数为1 =强制reduce数


cluster by = distribute by + sort by
  truncate table user_order;

  set mapreduce.job.reduces=2;
  insert into user_order select * from user_par cluster by id;

 最后输出文件个数为2 =reduce

 

 

 

Hive的存储格式
================================================================

行级存储

textfile:

sequencefile :二进制的k-v对

  hive-site.xml默认: SET hive.exec.compress.output=true;
             SET io.seqfile.compression.type=BLOCK;

 

 

列级存储
  rcfile
    先将数据进行横切(4M),成为行组,行组内又纵向切割分为多个字段


  orc
    比rc文件更大的块(256M),优化磁盘的线性读取,通过指定的编码器确定数据类型并优化压缩
    还存储了基本统计数据,比如min,max,sum,count。。。


  parquet
    适用范围更广(在hadoop生态系统中)
    适用于嵌套文件格式

 

大小
create table user_seq(id int, name string, pass string, email string, nickname string) stored as SEQUENCEFILE; x
create table user_rc(id int, name string, pass string, email string, nickname string) stored as rcfile;
create table user_orc2(id int, name string, pass string, email string, nickname string) stored as orc tblproperties("orc.compress"="ZLIB");
create table user_parquet2(id int, name string, pass string, email string, nickname string) stored as parquet tblproperties("parquet.compression"="GZIP"); //NOCOMPRESS、SNAPPY、GZIP


insert into user_seq select * from user_nopar;
insert into user_rc select * from user_nopar;
insert into user_orc2 select * from user_nopar;
insert into user_parquet2 select * from user_nopar;

1、比较生成文件大小

text:136.8 MB
seq:222.41 MB 性能差
rc:353.52 KB
orc:97.4 KB //默认有压缩:ZLIB
parquet:9.26 MB //默认无压缩:NO ===GZIP==> 62.5 KB

性能

select count(*) from user_nopar; //3.649
select count(*) from user_seq; //25.368 性能差
select count(*) from user_rc; //2.391
select count(*) from user_orc2; //2.334
select count(*) from user_parquet2; //3.339 √

 

 

本地模式

==============================================================================

//设置hive自动使用本地模式
SET hive.exec.mode.local.auto=true;
//输入文件大小低于此值会进入本地模式
SET hive.exec.mode.local.auto.inputbytes.max=500000000;
//输入文件个数低于此值会进入本地模式
SET hive.exec.mode.local.auto.input.files.max=5;

以上是关于HIVE 必知必会的主要内容,如果未能解决你的问题,请参考以下文章

大数据必知必会 | Hive架构设计和原理

大数据必知必会 | Hive架构设计和原理

MySQL必知必会

大数据必知必会系列——萌新提问怎么定义HiveUDF函数?能否给个示例[新星计划]

大数据面试杀招——Spark高频考点,必知必会!

必知必会的设计原则——合成复用原则