hive不同格式数据大小,无重复数据
Posted chenzechao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了hive不同格式数据大小,无重复数据相关的知识,希望对你有一定的参考价值。
-- 重点,目标表无重复数据
-- dbName.num_result 无重复记录 -- 插入数据 CREATE TABLE dbName.test_textfile( `key` string, `value` string, `p_key` string, `p_key2` string) STORED AS textfile ; insert overwrite table dbName.test_textfile select * from dbName.num_result where p_key=‘9‘ and p_key2=‘0‘; drop table dbName.test_orcfile; CREATE TABLE dbName.test_orcfile( `key` string, `value` string, `p_key` string, `p_key2` string) STORED AS orc ; insert overwrite table dbName.test_orcfile select * from test_textfile; CREATE TABLE dbName.test_rcfile( `key` string, `value` string, `p_key` string, `p_key2` string) STORED AS rcfile ; insert overwrite table dbName.test_rcfile select * from test_textfile; CREATE TABLE dbName.test_parquet( `key` string, `value` string, `p_key` string, `p_key2` string) STORED AS parquet ; insert overwrite table dbName.test_parquet select * from test_textfile; -- 统计数据量 select count(1) as cnt from dbName.test_textfile; select count(1) as cnt from dbName.test_orcfile; select count(1) as cnt from dbName.test_rcfile; select count(1) as cnt from dbName.test_parquet; -- 统计文件大小 dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_text*; dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_par*; dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_rc*; dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_orc*;
1.0 G 3.1 G hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile 1.1 G 3.3 G hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet 984.0 M 2.9 G hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile 470.0 M 1.4 G hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile
从结果可以看出,在无重复数据的情况下,parquet的压缩无用武之地,占用空间比textfile还大,ORC是压缩最强的文件模式。
hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_text*; 1110741501 3332224503 hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_par*; 1167366639 3502099917 hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_rc*; 1031774688 3095324064 hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_orc*; 492795434 1478386302 hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile
以上是关于hive不同格式数据大小,无重复数据的主要内容,如果未能解决你的问题,请参考以下文章