hive的几种文件格式

Posted 2023-03-21

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive的几种文件格式相关的知识，希望对你有一定的参考价值。

hive文件存储格式包括以下几类：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORCFILE(0.11以后出现)

其中TEXTFILE为默认格式，建表时不指定默认为这个格式，导入数据时会直接把数据文件拷贝到hdfs上不进行处理；

SEQUENCEFILE，RCFILE，ORCFILE格式的表不能直接从本地文件导入数据，数据要先导入到textfile格式的表中，然后再从表中用insert导入SequenceFile,RCFile,ORCFile表中。

前提创建环境：

hive 0.8

创建一张testfile_table表，格式为textfile。

create table if not exists testfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\\t' stored as textfile;

load data local inpath '/app/weibo.txt' overwrite into table textfile_table;

一、TEXTFILE
默认格式，数据不做压缩，磁盘开销大，数据解析开销大。
可结合Gzip、Bzip2使用(系统自动检查，执行查询时自动解压)，但使用这种方式，hive不会对数据进行切分，
从而无法对数据进行并行操作。
示例：

create table if not exists textfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\\t'stored as textfile;
插入数据操作：set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table textfile_table select * from textfile_table;

二、SEQUENCEFILE
SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。
SequenceFile支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩。
示例：

create table if not exists seqfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\\t'stored as sequencefile;
插入数据操作：set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;insert overwrite table seqfile_table select * from textfile_table;

三、RCFILE
RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。
RCFILE文件示例：

create table if not exists rcfile_table(
site string,
url string,
pv bigint,
label string)
row format delimited
fields terminated by '\\t'stored as rcfile;
插入数据操作：set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table rcfile_table select * from textfile_table;

四、ORCFILE()
五、再看TEXTFILE、SEQUENCEFILE、RCFILE三种文件的存储情况：

[hadoop@node3 ~]$ hadoop dfs -dus /user/hive/warehouse/*hdfs://node1:19000/user/hive/warehouse/hbase_table_1 0
hdfs://node1:19000/user/hive/warehouse/hbase_table_2 0
hdfs://node1:19000/user/hive/warehouse/orcfile_table 0
hdfs://node1:19000/user/hive/warehouse/rcfile_table 102638073
hdfs://node1:19000/user/hive/warehouse/seqfile_table 112497695
hdfs://node1:19000/user/hive/warehouse/testfile_table 536799616
hdfs://node1:19000/user/hive/warehouse/textfile_table 107308067
[hadoop@node3 ~]$ hadoop dfs -ls /user/hive/warehouse/*/-rw-r--r-- 2 hadoop supergroup 51328177 2014-03-20 00:42 /user/hive/warehouse/rcfile_table/000000_0-rw-r--r-- 2 hadoop supergroup 51309896 2014-03-20 00:43 /user/hive/warehouse/rcfile_table/000001_0-rw-r--r-- 2 hadoop supergroup 56263711 2014-03-20 01:20 /user/hive/warehouse/seqfile_table/000000_0-rw-r--r-- 2 hadoop supergroup 56233984 2014-03-20 01:21 /user/hive/warehouse/seqfile_table/000001_0-rw-r--r-- 2 hadoop supergroup 536799616 2014-03-19 23:15 /user/hive/warehouse/testfile_table/weibo.txt-rw-r--r-- 2 hadoop supergroup 53659758 2014-03-19 23:24 /user/hive/warehouse/textfile_table/000000_0.gz-rw-r--r-- 2 hadoop supergroup 53648309 2014-03-19 23:26 /user/hive/warehouse/textfile_table/000001_1.gz

总结:
相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。

参考技术A hive支持的存储格式：
　　hive支持的存储格式包括TextFile、SequenceFile、RCFile、Avro Files、ORC Files、Parquet。
TextFile：
　　Hive默认格式，数据不做压缩，磁盘开销大，数据解析开销大。
可结合Gzip、Bzip2、Snappy等使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。
SequenceFile：
　　SequenceFile是Hadoop API 提供的一种二进制文件，它将数据以的形式序列化到文件中。这种二进制文件内部使用Hadoop 的标准的Writable 接口实现序列化和反序列化。它与Hadoop API中的MapFile 是互相兼容的。Hive 中的SequenceFile 继承自Hadoop API 的SequenceFile，不过它的key为空，使用value 存放实际的值，这样是为了避免MR 在运行map 阶段的排序过程。
SequenceFile的文件结构图：

Header通用头文件格式：
SEQ 3BYTE
Nun 1byte数字
keyClassName
ValueClassName
compression （boolean）指明了在文件中是否启用压缩
blockCompression （boolean，指明是否是block压缩）
compression codec
Metadata 文件元数据
Sync 头文件结束标志
Block-Compressed SequenceFile格式

　　RCFile
RCFile是Hive推出的一种专门面向列的数据格式。它遵循“先按列划分，再垂直划分”的设计理念。当查询过程中，针对它并不关心的列时，它会在IO上跳过这些列。需要说明的是，RCFile在map阶段从远端拷贝仍然是拷贝整个数据块，并且拷贝到本地目录后RCFile并不是真正直接跳过不需要的列，并跳到需要读取的列，而是通过扫描每一个row group的头部定义来实现的，但是在整个HDFS Block 级别的头部并没有定义每个列从哪个row group起始到哪个row group结束。所以在读取所有列的情况下，RCFile的性能反而没有SequenceFile高。

以上是关于hive的几种文件格式的主要内容，如果未能解决你的问题，请参考以下文章

hive 存储格式对比

Java 修改编码格式的几种方式

c#如何获取某目录下的几种格式的图片文件。Directory.GetFiles

Java 修改编码格式的几种方式

基于 Hive 的文件格式：RCFile 简介及其应用

PHPWordPHPWord导出PDF格式文件的几种方式以及最优解并附代码