大数据--hive分桶查询&&压缩方式

Posted 2022-12-17 jeff190812

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了大数据--hive分桶查询&&压缩方式相关的知识，希望对你有一定的参考价值。

一、分桶及抽样查询

1、分桶表创建

---------------------------------------

hive (db_test)> create table stu_buck(id int,name string)
> clustered by(id)
> into 4 buckets
> row format delimited fields terminated by ‘\\t‘;
OK
Time taken: 0.369 seconds

------------------------------------------------------------------------

hive (db_test)> desc formatted stu_buck;
OK
col_name data_type comment
# col_name data_type comment

id int
name string

# Detailed Table Information
Database: db_test
Owner: root
CreateTime: Thu Oct 03 12:14:15 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/db_test.db/stu_buck
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1570076055

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 4
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
field.delim \\t
serialization.format \\t
Time taken: 0.121 seconds, Fetched: 28 row(s)

------------------------------------------------------------------

2、数据加载到分桶表

2.1、先创建普通表

------------------------------------------------------------------

hive (db_test)> create table stu_comm(id int,name string)
> row format delimited fields terminated by ‘\\t‘;
OK
Time taken: 0.181 seconds

---------------------------------------------------------------------

2.2、加载本地数据到普通表

-------------------------------------------------------------------------

hive (db_test)> load data local inpath ‘/root/hivetest/stu_buck‘ into table stu_comm;
Loading data to table db_test.stu_comm
Table db_test.stu_comm stats: [numFiles=1, totalSize=501]
OK
Time taken: 0.654 seconds

hive (db_test)> select * from stu_comm;
OK
stu_comm.id stu_comm.name
1001 张三
1002 李四
1003 王五
1004 赵六
1005 李琪
1006 赵云
1007 黄月英
1008 诸葛亮
1009 司马懿
1010 张飞
1011 关羽
1012 刘备
1013 曹操
1014 曹植
1015 曹丕
1016 嬴政
1017 韩信
1018 孙权
1019 孙尚香
1020 孙斌
1021 大桥
1022 小乔
1023 鲁班
1024 干将
1025 白起
1026 李白
1027 李信
1028 墨菲特
1029 易
1030 亚瑟
1031 安其拉
1032 妲己
1033 吕布
1034 张苞
1035 鲁肃
1036 董卓
1037 马谡
1038 夏侯惇
1039 夏侯渊
1040 黄忠
Time taken: 0.081 seconds, Fetched: 40 row(s)

-----------------------------------------------------------------------------

2.3、设置hive的分桶相关属性，分桶分的是数据，需要多个mapreduce作业来处理

-------------------------------------------------------------------

hive (db_test)> set hive.enforce.bucketing=true;
hive (db_test)> set mapreduce.job.reduces=4;
hive (db_test)> set hive.enforce.bucketing;
hive.enforce.bucketing=true
hive (db_test)> set mapreduce.job.reduces;
mapreduce.job.reduces=4

---------------------------------------------------------

2.4、通过load加载数据到分桶表无法按照文件分桶，需要通过insert语句来加载数据，也就是通过mapreduce作业来进行分文件

----------------------------------------------------------

hive (db_test)> insert into table stu_buck select id,name from stu_comm;
Query ID = root_20191003122918_48ead4a6-8f19-4f0a-8298-6a57b467bf47
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1570075776894_0001, Tracking URL = http://bigdata112:8088/proxy/application_1570075776894_0001/
Kill Command = /opt/module/hadoop-2.8.4/bin/hadoop job -kill job_1570075776894_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-10-03 12:29:29,888 Stage-1 map = 0%, reduce = 0%
2019-10-03 12:29:38,338 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec
2019-10-03 12:29:45,839 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 3.17 sec
2019-10-03 12:29:47,932 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 4.74 sec
2019-10-03 12:29:52,077 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.54 sec
MapReduce Total cumulative CPU time: 6 seconds 540 msec
Ended Job = job_1570075776894_0001
Loading data to table db_test.stu_buck
Table db_test.stu_buck stats: [numFiles=4, numRows=40, totalSize=501, rawDataSize=461]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 6.54 sec HDFS Read: 15129 HDFS Write: 793 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 540 msec
OK
id name
Time taken: 34.6 seconds

技术图片

=================================================

3、分桶表的数据抽样查询

3.1、分别查询4个文件的数据

-------------------------------------------------------------------------------------

//查询第一个文件的数据

hive (db_test)> dfs -cat /db_test.db/stu_buck/000000_0 http://192.168.1.121:50070/;
cat: No FileSystem for scheme: http
1040 黄忠
1036 董卓
1032 妲己
1028 墨菲特
1024 干将
1020 孙斌
1016 嬴政
1012 刘备
1008 诸葛亮
1004 赵六
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null

=======================================

//查询第二个文件的内容

hive (db_test)> dfs -cat /db_test.db/stu_buck/000001_0 http://192.168.1.121:50070/;
cat: No FileSystem for scheme: http
1005 李琪
1029 易
1037 马谡
1017 韩信
1001 张三
1033 吕布
1009 司马懿
1013 曹操
1025 白起
1021 大桥
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null

===========================================

//查询第三个文件的内容

hive (db_test)> dfs -cat /db_test.db/stu_buck/000002_0 http://192.168.1.121:50070/;
cat: No FileSystem for scheme: http
1010 张飞
1038 夏侯惇
1022 小乔
1034 张苞
1002 李四
1026 李白
1018 孙权
1030 亚瑟
1014 曹植
1006 赵云
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null

==============================================

//查询第四个文件的内容

hive (db_test)> dfs -cat /db_test.db/stu_buck/000003_0 http://192.168.1.121:50070/;
cat: No FileSystem for scheme: http
1015 曹丕
1007 黄月英
1027 李信
1023 鲁班
1019 孙尚香
1003 王五
1011 关羽
1039 夏侯渊
1035 鲁肃
1031 安其拉
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null

====================================================

3.2、抽样查询分桶表的两份数据

---------------------------------------------------------------------

//查询文件1和文件3的数据内容

hive (db_test)> select * from stu_buck tablesample(bucket 1 out of 2 on id);
OK
stu_buck.id stu_buck.name
1040 黄忠
1036 董卓
1032 妲己
1028 墨菲特
1024 干将
1020 孙斌
1016 嬴政
1012 刘备
1008 诸葛亮
1004 赵六
1010 张飞
1038 夏侯惇
1022 小乔
1034 张苞
1002 李四
1026 李白
1018 孙权
1030 亚瑟
1014 曹植
1006 赵云
Time taken: 0.077 seconds, Fetched: 20 row(s)

----------------------------------------------------------------------

//查询文件2和文件4的数据那内容

hive (db_test)> select * from stu_buck tablesample(bucket 2 out of 2 on id);
OK
stu_buck.id stu_buck.name
1005 李琪
1029 易
1037 马谡
1017 韩信
1001 张三
1033 吕布
1009 司马懿
1013 曹操
1025 白起
1021 大桥
1015 曹丕
1007 黄月英
1027 李信
1023 鲁班
1019 孙尚香
1003 王五
1011 关羽
1039 夏侯渊
1035 鲁肃
1031 安其拉
Time taken: 0.097 seconds, Fetched: 20 row(s)

---------------------------------------------------------------------

//bucket 4 out of 2 on id 前面的数字不能大于后面的数字

hive (db_test)> select * from stu_buck tablesample(bucket 4 out of 2 on id);
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

----------------------------------------------------------------------

二、hive的压缩方式及是否支持切分

压缩格式	工具	算法	文件扩展名	是否可切分
DEFAULT	无	DEFAULT	.deflate	否
Gzip	gzip	DEFAULT	.gz	否
bzip2	bzip2	bzip2	.bz2	是
LZO	lzop	LZO	.lzo	是
Snappy	无	Snappy	.snappy	否

为了支持多种压缩/解压缩算法，Hadoop引入了编码/解码器

压缩格式	对应的编码/解码器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

1、开启map端输出阶段的压缩（临时设置生效），想永久生效需要在配置文件里面设置

------------------------------------------------------------------

1）开启hive中间传输数据压缩功能,默认为false

hive (default)>set hive.exec.compress.intermediate=true;

2）开启mapreduce中map输出压缩功能,默认为false

hive (default)>set mapreduce.map.output.compress=true;

3）设置mapreduce中map输出数据的压缩方式

hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

----------------------------------------------------------------

2、开启reduce端输出阶段的压缩

---------------------------------------------------------

1）开启hive最终输出数据压缩功能,默认为false

hive (default)>set hive.exec.compress.output=true;

2）开启mapreduce最终输出数据压缩,默认为false

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3）设置mapreduce最终数据输出压缩方式

hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4）设置mapreduce最终数据输出压缩为块压缩

hive (default)> set mapreduce.output.fileoutset mapreduce.output.fileoutputformat.compress.type=BLOCK;

putformat.compress.type=BLOCK;

===================================================================

以上是关于大数据--hive分桶查询&&压缩方式的主要内容，如果未能解决你的问题，请参考以下文章