在 AWS Athena 的 json 文件中存储多个元素
Posted
技术标签:
【中文标题】在 AWS Athena 的 json 文件中存储多个元素【英文标题】:Store multiple elements in json files in AWS Athena 【发布时间】:2017-06-21 09:44:47 【问题描述】:我有一些 json 文件存储在 S3 存储桶中,其中每个文件都有多个相同结构的元素。例如,
["eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":"Message":"New item!","Id":101,"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":"Message":"This item has changed","Id":101,"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":"Message":"This item has changed","Id":101]
我想在 Athena 中创建一个与上述数据对应的表。
我为创建表编写的查询:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
但如果我按如下方式进行 SELECT 查询,
SELECT * FROM sampledb.elb_logs4;
我得到以下结果:
1 "eventid":"1","eventversion":"1.0","image":"id":"101","message":"New item!","eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2" "eventid":"2","eventversion":"1.0","image":"id":"101","message":"This item has changed","eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2" "eventid":"3","eventversion":"1.0","image":"id":"101","message":"This item has changed","eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"
json 文件的全部内容在这里被选为一个条目。
如何将 json 文件的每个元素作为一个条目读取?
编辑:如何读取图像的每个子列,即地图的每个元素?
谢谢。
【问题讨论】:
【参考方案1】:问题1:将多个元素存储在AWS Athena的json文件中
我需要将我的 json 文件重写为
"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2", "image":"Message":"新项目!","Id":101, "eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource ":"aws:dynamodb","awsRegion":"us-west-2","image":"Message":"该项目已更改","Id":101, "eventId":" 3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":"Message": "此项目已更改","Id":101
意思是
去掉方括号 [ ] 让每个元素在一行中
.....................
.....................
.....................
问题2。访问非线性 json 属性
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
查询:
select image.Id, image.message from <tablename>;
参考:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
【讨论】:
我和 Q1 有同样的问题,因为我的数据来自 sendgrid,我对数据格式没有太多选择:( 您是否能够告诉 kinesis firehose 在一个 S3 文件中的每个条目后放置一个新行?以上是关于在 AWS Athena 的 json 文件中存储多个元素的主要内容,如果未能解决你的问题,请参考以下文章
AWS Athena 可以更新或插入存储在 S3 中的数据吗?
Amazon AWS Athena S3 和 Glacier 混合存储桶
Spark SQL 查询以获取在 AWS S3 中存储为 CSV 的 Athena 表的最后更新时间戳