AWS Spectrum 真的需要 = in s3 location 才能将其理解为 hive 格式吗？

Posted 2023-03-30

技术标签:

【中文标题】AWS Spectrum 真的需要 = in s3 location 才能将其理解为 hive 格式吗？【英文标题】：Do AWS Spectrum really need = in s3 location to understand it as hive format? 【发布时间】：2021-03-02 15:09:21 【问题描述】：

我用spectrum 运行了一些测试。

我创建了两个AWS Glue crawler。

第一个名为hive-tst的扫描：

s3://hive-test/type='a'/year='2021'/month='01'
s3://hive-test/type='b'/year='2021'/month='01'
s3://hive-test/type='c'/year='2021'/month='01'
s3://hive-test/type='d'/year='2021'/month='01'
s3://hive-test/type='e'/year='2021'/month='01'

第二个扫描：

s3://non-hive-test/a/2021/01
s3://non-hive-test/b/2021/01
s3://non-hive-test/c/2021/01
s3://non-hive-test/d/2021/01
s3://non-hive-test/e/2021/01

每个bucket分区都有两个文件，两个文件都是parquet文件，大小为50mb。

然后我运行一个查询每个spectrum 表的第一个分区的测试：

select distinct event from test.hive_tst;

花了 8s 272

select distinct partition_0 from test.nonhive_tst;

耗时 8s 66ms

所以添加= 似乎并没有提高性能。还检查了两个表在分区中都有Hive 格式。

select *
from svv_external_partitions
where schemaname='test'
and tablename='hive_tst';

values	location	input_format	output_format	serialization_lib
["a","2021","01"]	s3://hive-test/event=a/year=2021/month=01/	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

select *
from svv_external_partitions
where schemaname='test'
and tablename='nonhive_tst';

values	location	input_format	output_format	serialization_lib
["a","2021","01"]	s3://hive-test/a/2021/01/	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe