hive表怎么只读取一部分数据

Posted 2023-05-12

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive表怎么只读取一部分数据相关的知识，希望对你有一定的参考价值。

参考技术A Hive的insert语句能够从查询语句中获取数据，并同时将数据Load到目标表中。现在假定有一个已有数据的表staged_employees（雇员信息全量表），所属国家cnty和所属州st是该表的两个属性，我们做个试验将该表中的数据查询出来插入到另一个表employe...本回答被提问者采纳

Hive的分区操作

一、Hive分区
(一)、分区概念：
为什么要创建分区：单个表数据量越来越大的时候，在Hive Select查询中一般会扫描整个表
内容，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，因此建表
时引入了partition概念。
(1)、Hive的分区和mysql的分区差异：mysql分区是将表中的字段拿来直接作为分区字段，
而hive的分区则是分区字段不在表中。
(2)、怎么分区：根据业务分区，(完全看业务场景)选取id、年、月、日、男女性别、年龄段
或者是能平均将数据分到不同文件中最好,分区不好将直接导致查询结果延迟。
(3)、分区细节:
1、一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下。
2、表和列名不区分大小写。
3、分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在(算是
一个伪列)，但是该字段不存放实际的数据内容，仅仅是分区的表示。
4、分区有一级、二级、三级和多级分区：
5、创建动态分区、静态分区、混合分区：
动态分区：可以动态加载数据
静态分区：可以静态加入数据
混合分区：动态和静态结合加入数据

hive的分区则是分区字段不在表中。***********************

二、分区案例
案例1：使用hive的分区表对1608c班学生信息按性别存储；
create table if not exists part_1608c(
sno int,
sname string,
sage int,
saddress string
)
partitioned by(sex string)
row format delimited fields terminated by‘,‘;

说明;创建分区表：
partitioned by(sex string) 设置分区字段，并且分区字段不在表中

vi part_1608c_nan.txt
10001,laowu,18,daxing
10002,laowang,48,fanshan
10003,laozhang,8,daxing
10004,laoxu,18,daxing

vi part_1608c_nv.txt
10005,xiaowu,28,daxing
10006,xiaowang,18,fanshan
10007,xiaozhang,18,daxing
10008,xiaoxu,18,daxing

*************LOAD DATA 方式加载数据到分区**********************

对分区表数据的导入方式：
load data local inpath ‘/opt/data/part_1608c_nan.txt‘ into table part_1608c partition(sex=‘nan‘);
load data local inpath ‘/opt/data/part_1608c_nv.txt‘ into table part_1608c partition(sex=‘nv‘);

查看表的分区：
show partitions part_1608c;

添加分区：
alter table part_1608c add partition(sex=‘bunan‘);
或者
alter table part_1608c add partition(sex=‘bunv‘) partition(sex=‘bunanbunv‘);
重命名分区：
alter table part_1608c partition(sex=‘bunanbunv‘) rename to partition(sex=‘nannv‘);
删除分区：
alter table part_1608c drop partition(sex=‘bunv‘);
alter table part_1608c drop partition(sex=‘bunan‘);
alter table part_1608c drop partition(sex=‘nannv‘);

分区表的查询：
说明：对于分区表，在严格模式下查询分区表时必须使用where带上分区字段和分区值!
set hive.mapred.mode=strict;
select * from part_1608c where sex=‘nan‘;

案例2：使用hive的分区表对1608c班学生信息按性别存储；
创建一个普通表：
create table if not exists tb_students(
sno int,
sname string,
sage int,
saddress string,
sex string
)
row format delimited fields terminated by‘,‘;

vi tb_students.txt
10001,laowu,18,daxing,nan
10002,laowang,48,fanshan,nv
10003,laozhang,8,daxing,nan
10004,laoxu,18,daxing,nv
10005,xiaowu,28,daxing,nan
10006,xiaowang,18,fanshan,nv
10007,xiaozhang,18,daxing,nv
10008,xiaoxu,18,daxing,nan

load data local inpath ‘/opt/data/tb_students.txt‘ into table tb_students;

创建一个分区表：
create table if not exists part_1608c2(
sno int,
sname string,
sage int,
saddress string
)
partitioned by(sex string)
row format delimited fields terminated by‘,‘;

***************INSERT INTO 方式添加数据到分区表***************
insert into table part_1608c2 partition(sex=‘nan‘)
select sno,sname,sage,saddress from tb_students where sex=‘nan‘;

insert into table part_1608c2 partition(sex=‘nv‘)
select sno,sname,sage,saddress from tb_students where sex=‘nv‘;

总结：像上面两种方式加载数据到分区的方式加静态分区
静态分区：指定分区数量和字段值(sex=‘nan‘、sex=‘nv‘)
静态分区的使用场景：当数据的分区字段数量和分区值确定，并且分区数量比较少时使用静态分区！

动态分区案例
案例2：将学生信息按年龄分区
创建一个普通表：
create table if not exists tb_students2(
sno int,
sname string,
saddress string,
sex string,
sage int
)
row format delimited fields terminated by‘,‘;

vi tb_students2.txt
10001,laowu,daxing,nan,18
10002,laowang,fanshan,nv,48
10003,laozhang,daxing,nan,8
10004,laoxu,daxing,nv,18
10005,xiaowu,daxing,nan,28
10006,xiaowang,fanshan,nv,18
10007,xiaozhang,daxing,nv,18
10008,xiaoxu,daxing,nan,18

load data local inpath ‘/opt/data/tb_students2.txt‘ into table tb_students2;

创建分区表：分区依据是年龄
create table if not exists part_students2(
sno int,
sname string,
saddress string,
sex string
)
partitioned by(sage int)
row format delimited fields terminated by‘,‘;

动态分区;使用动态方式实现按年龄分区
动态分区时只能以结果集的方式将数据动态分区到分区表：

要能使用动态分区，必须打开动态分区模式，并且设置分区模式为非严格模式！
1.打开动态分区模式：
set hive.exec.dynamic.partition=true;
2.设置分区模式为非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;

insert into table part_students2 partition(sage)
select sno,sname,saddress,sex,sage from tb_students2;

总结：像上面插入分区表数据的方式是动态分区
动态分区：在插入数据时，不确定分区数量并且分区数量不是特别大的时候可以使用动态分区
动态分区，在插入数据的时分区字段的值是不确定的！

**************混合分区****************
案例3：将用户信息按国家和城市分区
创建用户信息表：
create table if not exists users(
ucard int,
uname string,
contry string,
city string
)
row format delimited fields terminated by‘\t‘;

加载数据：
load data local inpath ‘/opt/data/city.txt‘ into table users;

创建二级分区表：
create table if not exists part_users(
ucard bigint,
uname string
)
partitioned by(contry string,city string)
row format delimited fields terminated by‘\t‘;

混合分区：有静态分区字段和动态分区字段混合
insert into table part_users partition(contry="USA",city)
select ucard,uname,city from users where contry=‘USA‘;

insert into table part_users partition(contry="CH",city)
select ucard,uname,city from users where contry=‘CH‘;

insert into table part_users partition(contry="UK",city)
select ucard,uname,city from users where contry=‘UK‘;

混合分区注意;主分区字段必须是静态字段、辅助分区可以是动态。

静态分区：
动态分区：
混合分区：

案例4：数据如果已经落地在hdfs系统的目录下，如果创建hive表管理已经落地的数据！
模拟落地数据：
mkdir /opt/data/source
cd /opt/data/source
[[email protected] source]# mkdir 2016/01/01 -p
[[email protected] source]# mkdir 2016/01/02 -p
[[email protected] source]# mkdir 2016/01/03 -p
[[email protected] source]# mkdir 2016/01/04 -p
[[email protected] source]# mkdir 2016/02/04 -p
[[email protected] source]# mkdir 2016/02/03 -p
[[email protected] source]# mkdir 2016/02/02 -p
[[email protected] source]# mkdir 2016/02/01 -p
[[email protected] source]# mkdir 2016/03/01 -p
[[email protected] source]# mkdir 2016/02/01 -p
[[email protected] source]# mkdir 2016/03/02 -p
[[email protected] source]# mkdir 2016/03/03 -p
[[email protected] source]# mkdir 2016/03/04 -p

[[email protected] data]# cp flow.log ./source/2016/01/01/
[[email protected] data]# cp flow.log ./source/2016/01/02/
[[email protected] data]# cp flow.log ./source/2016/01/03/
[[email protected] data]# cp flow.log ./source/2016/03/01/
[[email protected] data]# cp flow.log ./source/2016/02/01/

将模拟数据上传到hidfs系统：

hadoop fs -put ./source /

创建外部分区表，并且location指向数据目录：
CREATE external TABLE IF NOT EXISTS part_flow(
id string,
phonenumber bigint,
mac string,
ip string,
url string,
tiele string,
colum1 string,
colum2 string,
colum3 string,
upflow int,
downflow int
)
partitioned by(year int,month int,day int)
row format delimited fields terminated by‘\t‘
location ‘/source‘;

给源数据添加分区：
alter table part_flow add partition(year=2016,month=01,day=01)
location ‘hdfs:///source/2016/01/01‘;

alter table part_flow add partition(year=2016,month=03,day=01)
location ‘hdfs:///source/2016/03/01/‘;

alter table part_flow add partition(year=2016,month=01,day=03)
location ‘hdfs:///source/2016/01/03/‘;

以上是关于hive表怎么只读取一部分数据的主要内容，如果未能解决你的问题，请参考以下文章