Hive中的数据压缩
Posted 数据咖啡
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive中的数据压缩相关的知识,希望对你有一定的参考价值。
1.数据文件存储格式
下面简介一下hive 支持的存储格式
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
数据存储格式分为按行存储数据和按列存储数据。 (1)ORCFile(Optimized Row Columnar File):hive/shark/spark支持。使用ORCFile格式存储列数较多的表。 (2)Parquet(twitter+cloudera开源,被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比较复杂,其灵感主要来自于dremel,parquet存储结构的主要亮点是支持嵌套数据结构以及高效且种类丰富算法(以应对不同值分布特征的压缩)。(1)存储为TEXTFILE格式
create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
load data local inpath '/opt/datas/page_views.data' into table page_views ;
dfs -du -h /user/hive/warehouse/page_views/ ;
18.1 M /user/hive/warehouse/page_views/page_views.data
(2)存储为ORC格式
create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;
insert into table page_views_orc select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc/ ;
2.6 M /user/hive/warehouse/page_views_orc/000000_0
(3)存储为Parquet格式
create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;
insert into table page_views_parquet select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet/ ;
13.1 M /user/hive/warehouse/page_views_parquet/000000_0
(4)存储为ORC格式,使用snappy压缩
create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
insert into table page_views_orc_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;
3.8 M /user/hive/warehouse/page_views_orc_snappy/000000_0
这里为什么会大了呢 因为默认使用Gzip压缩的
(5)存储为ORC格式,不使用压缩
create table page_views_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
insert into table page_views_orc_none select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;
7.6 M /user/hive/warehouse/page_views_orc_none/000000_0
(6)存储为Parquet格式,使用snappy压缩
set parquet.compression=SNAPPY ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;
insert into table page_views_parquet_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;
2.7 M /user/hive/warehouse/page_views_parquet_snappy/000000_0
在实际的项目开发当中,hive表的数据的存储格式一般使用orcfile / parquet,数据压缩一般使用snappy压缩格式。 转载自 https://blog.csdn.net/gongxifacai_believe/article/details/80833480
欢迎关注,更多福利
以上是关于Hive中的数据压缩的主要内容,如果未能解决你的问题,请参考以下文章