雅典娜和 S3 库存。 HIVE_BAD_DATA：ORC 中字段大小的 LONG 类型与表模式中定义的类型 varchar 不兼容

Posted 2023-02-15

技术标签:

【中文标题】雅典娜和 S3 库存。 HIVE_BAD_DATA：ORC 中字段大小的 LONG 类型与表模式中定义的类型 varchar 不兼容【英文标题】：Athena and S3 Inventory. HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema 【发布时间】：2018-06-25 07:27:05 【问题描述】：

我正在尝试了解如何使用 s3 库存。我正在关注这个tutorial

将库存清单加载到我的表中后，我尝试对其进行查询并发现两个问题。

1) SELECT key, size FROM table; 所有记录的大小列显示一个幻数（值）4923069104295859283

2) select * from table; 查询 ID：cf07c309-c685-4bf4-9705-8bca69b00b3c。

接收错误：

HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema

这是我的表架构：

CREATE EXTERNAL TABLE `table`(
`bucket` string, 
`key` string, 
`version_id` string, 
`is_latest` boolean, 
`is_delete_marker` boolean, 
`size` bigint, 
`last_modified_date` timestamp, 
`e_tag` string, 
`storage_class` string)
PARTITIONED BY ( 
`dt` string)
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://......../hive'
TBLPROPERTIES (
'transient_lastDdlTime'='1516093603')

【问题讨论】：

有同样的问题（布尔字段）得到完全相同的错误 【参考方案1】：

我们案例中的错误是仅在配置中使用“当前版本”。 'current' 和 'all' 版本配置的区别在于没有列：

version_id is_latest is_delete_marker

ORC 格式的示例 hive 输出：

> hive --orcfiledump ./inventoryexamplefilewithcurrentversiononly.orc
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string,object_lock_retain_until_date:timestamp,object_lock_mode:string,object_lock_legal_hold_status:string>

为 ORC 格式创建表：

-- Create table IF USING 'CURRENT VERSION' only in S3 inventory config
CREATE EXTERNAL TABLE your_table_name(
  `bucket` string,
  key string,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string,
  object_lock_retain_until_date timestamp,
  object_lock_mode string,
  object_lock_legal_hold_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://.../hive/';

另外，顺便说一句，如果您使用“SSE-KMS”（即您自己的 KMS 密钥）而不是“SSE-S3”加密，则不需要配置“has_encrypted_data”。

注意

使用 SSE-KMS，Athena 不需要您在创建表时指明数据已加密。

来源：https://docs.aws.amazon.com/athena/latest/ug/encryption.html

【讨论】：

【参考方案2】：

来自 AWS S3 生成的清单的任何 orc 文件的以下命令将为您提供清单的实际结构：

$> hive --orcfiledump ~/Downloads/017c2014-1205-4431-a30d-2d9ae15492d6.orc
...
Processing data file /tmp/017017c2014-1205-4431-a30d-2d9ae15492d6.orc [length: 4741786]
Structure for /mp/017c2014-1205-4431-a30d-2d9ae15492d6.orc
File Version: 0.12 with ORC_135
Rows: 223473
Compression: ZLIB
Compression size: 262144
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string>
...

看来，aws here 提供的示例预计您的清单不仅适用于 current version，而且适用于您存储桶中的 all versions 对象。

Athena 的正确表结构是加密存储桶：

CREATE EXTERNAL TABLE inventory(
  bucket string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://............/hive'
  TBLPROPERTIES ('has_encrypted_data'='true');

【讨论】：

非常感谢您的回答。我真的很挣扎。现在我们准备好了！ dropbox.com/s/lts40z1tgtrqpwe/…

以上是关于雅典娜和 S3 库存。 HIVE_BAD_DATA：ORC 中字段大小的 LONG 类型与表模式中定义的类型 varchar 不兼容的主要内容，如果未能解决你的问题，请参考以下文章

SQL（雅典娜）中的取消嵌套：如何将结构数组转换为从结构中提取的值数组？

如何使用 AWS Glue 将许多 CSV 文件转换为 Parquet