Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现
Posted
技术标签:
【中文标题】Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现【英文标题】:Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task 【发布时间】:2018-01-21 11:19:05 【问题描述】:上下文:
我能够从 druid overlord 向 EMR 提交 MapReduce 作业。我的数据源是 Parquet 格式的 S3。我在镶木地板数据中有一个时间戳列 (INT96),Avroschema 不支持该列。
解析时间戳时出错
问题堆栈跟踪是:
Error: java.lang.IllegalArgumentException: INT96 not yet implemented.
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)
环境:
Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3
德鲁伊输入 json
"type": "index_hadoop",
"spec":
"ioConfig":
"type": "hadoop",
"inputSpec":
"type": "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "s3://s3_path"
,
"dataSchema":
"dataSource": "parquet_test1",
"granularitySpec":
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2017-08-01T00:00:00/2017-08-02T00:00:00"]
,
"parser":
"type": "parquet",
"parseSpec":
"format": "timeAndDims",
"timestampSpec":
"column": "t",
"format": "yyyy-MM-dd HH:mm:ss:SSS zzz"
,
"dimensionsSpec":
"dimensions": [
"dim1","dim2","dim3"
],
"dimensionExclusions": [],
"spatialDimensions": []
,
"metricsSpec": [
"type": "count",
"name": "count"
,
"type" : "count",
"name" : "pid",
"fieldName" : "pid"
]
,
"tuningConfig":
"type": "hadoop",
"partitionsSpec":
"targetPartitionSize": 5000000
,
"jobProperties" :
"mapreduce.job.user.classpath.first": "true",
"fs.s3.awsAccessKeyId" : "KEYID",
"fs.s3.awsSecretAccessKey" : "AccessKey",
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId" : "KEYID",
"fs.s3n.awsSecretAccessKey" : "AccessKey",
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
,
"leaveIntermediate": true
, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
可能的解决方案
1. Save the data in parquet efficiently instead of transforming in Avro to remove the dependencies.
2. Fixing AvroSchema to support INT96 timestamp format of Parquet.
【问题讨论】:
我看到了讨论这个问题的帖子 [Link]mail-archives.apache.org/mod_mbox/parquet-dev/201607.mbox/… 任何想法,如果这在路线图中或任何其他可能的解决方案中。 【参考方案1】:0.17.0 及更高版本的 Druid 使用 Parquet Hadoop Parser 支持 Parquet INT96 类型。
Parquet Hadoop Parser 支持 int96 Parquet 值,而 Parquet Avro Hadoop Parser 不支持。 flattenSpec 的 JSON 路径表达式求值的行为也可能存在一些细微的差异。
https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#parquet-hadoop-parser
【讨论】:
以上是关于Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas:每周列(int)到时间戳列转换(以周为单位)