Kinesis Firehose - 啥是 S3 扩展目标配置?
Posted
技术标签:
【中文标题】Kinesis Firehose - 啥是 S3 扩展目标配置?【英文标题】:Kinesis Firehose - What is S3 extended destination configuration?Kinesis Firehose - 什么是 S3 扩展目标配置? 【发布时间】:2020-06-24 02:28:39 【问题描述】:问题
什么是 S3 扩展目标配置,AWS 文档中的哪些地方清楚地解释了它的用途?
顾名思义,它一定是关于 S3 目的地。但是,AWS 文档的 S3 目标部分没有提及。
Choose Amazon S3 for Your Destination如果有文章或博客解释清楚,请指点。
我一直在以下文档中寻找线索,但通常与 AWS 文档一样,不清楚。它看起来部分与输入记录转换或记录处理有关。
Amazon Kinesis Data Firehose API Reference - ExtendedS3DestinationConfiguration
描述 Amazon S3 中目标的配置。
Amazon Kinesis Data Firehose Developer Guide PDF - 转换输入记录格式 (API)
如果您希望 Kinesis Data Firehose 从 JSON 转换输入数据的格式 为 Parquet 或 ORC,指定可选的 DataFormatConversionConfiguration 元素 ExtendedS3DestinationConfiguration ...
AWS CloudFormation - AWS::KinesisFirehose::DeliveryStream ExtendedS3DestinationConfiguration
ExtendedS3DestinationConfiguration 属性类型为 Amazon Kinesis Data Firehose 传输流配置 Amazon S3 目标。
Extended S3 Destination
resource "aws_kinesis_firehose_delivery_stream" "extended_s3_stream"
name = "terraform-kinesis-firehose-extended-s3-test-stream"
destination = "extended_s3"
extended_s3_configuration
role_arn = "$aws_iam_role.firehose_role.arn"
bucket_arn = "$aws_s3_bucket.bucket.arn"
processing_configuration
enabled = "true"
processors
type = "Lambda"
parameters
parameter_name = "LambdaArn"
parameter_value = "$aws_lambda_function.lambda_processor.arn:$LATEST"
【问题讨论】:
【参考方案1】:Terraform 文档最擅长展示 S3 和扩展 S3 目标之间的区别:https://www.terraform.io/docs/providers/aws/r/kinesis_firehose_delivery_stream.html
S3 Extended 继承 S3 目标配置参数和额外的参数,例如 data_format_conversion_configuration
或 error_output_prefix
【讨论】:
【参考方案2】:恐怕 Kinesis Firehose 文档写得太差了,我想知道人们如何仅从文档中弄清楚如何使用 Firehose。
最初看起来,firehose 只是将数据中继到 S3 存储桶,并且没有内置的转换机制,并且 S3 目标配置没有像 AWS::KinesisFirehose::DeliveryStream S3DestinationConfiguration 中的处理配置。
然后和Amazon Kinesis Firehose Data Transformation with AWS Lambda一样,似乎在2017年初左右引入了一种转换记录的机制,因此添加了AWS::KinesisFirehose::DeliveryStream ExtendedS3DestinationConfiguration。
显然人们很难找到配置方法:
Does Amazon Kinesis Firehose support Data Transformations programmatically?好吧,经过大量的努力和文档搜索,我想通了。
谁能通过阅读 AWS 文档来弄清楚?
Amazon Kinesis Data Firehose Data Transformation用于 lambda 转换的 Firehose 扩展 S3 配置
无法从 AWS 文档中弄清楚,但在查看 Internet 中的实际实现后,看起来所需的配置如下。
更新
根据 Kevin Eid 的建议。
Resource: aws_kinesis_firehose_delivery_streams3_configuration - (可选)非 S3 目标需要。 对于 S3 目标,请改用 extended_s3_configuration。
The extended_s3_configuration object supports the same fields from s3_configuration as well as the following:
data_format_conversion_configuration - (Optional) Nested argument for the serializer, deserializer, and schema for converting data from the JSON format to the Parquet or ORC format before writing it to Amazon S3. More details given below.
error_output_prefix - (Optional) Prefix added to failed records before writing them to S3. This prefix appears immediately following the bucket name.
processing_configuration - (Optional) The data processing configuration. More details are given below.
s3_backup_mode - (Optional) The Amazon S3 backup mode. Valid values are Disabled and Enabled. Default value is Disabled.
s3_backup_configuration - (Optional) The configuration for backup in Amazon S3. Required if s3_backup_mode is Enabled. Supports the same fields as s3_configuration object.
我相信,由于兼容性或遗留原因,s3_configuration 仍然存在,因此只需要使用 extended_s3_configuration 但 AWS 文档没有正确解释。很遗憾 AWS 文档不能作为事实来源。
【讨论】:
【参考方案3】:ExtendedS3DestinationConfiguration 属性类型的第一个为 Amazon Kinesis Data Firehose 传输流配置 Amazon S3 目标。 看: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-extendeds3destinationconfiguration.html
谢谢。
【讨论】:
【参考方案4】:这个小屏幕截图显示了ExtendedS3DestinationConfiguration
与S3DestinationConfiguration
相比的新组件:
此外,扩展 s3 配置是什么以及如何定义的,请参见 API documentation:
"RoleARN": "string",
"BucketARN": "string",
"Prefix": "string",
"ErrorOutputPrefix": "string",
"BufferingHints":
"SizeInMBs": integer,
"IntervalInSeconds": integer
,
"CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
"EncryptionConfiguration":
"NoEncryptionConfig": "NoEncryption",
"KMSEncryptionConfig":
"AWSKMSKeyARN": "string"
,
"CloudWatchLoggingOptions":
"Enabled": true|false,
"LogGroupName": "string",
"LogStreamName": "string"
,
"ProcessingConfiguration":
"Enabled": true|false,
"Processors": [
"Type": "Lambda",
"Parameters": [
"ParameterName": "LambdaArn"|"NumberOfRetries"|"RoleArn"|"BufferSizeInMBs"|"BufferIntervalInSeconds",
"ParameterValue": "string"
...
]
...
]
,
"S3BackupMode": "Disabled"|"Enabled",
"S3BackupUpdate":
"RoleARN": "string",
"BucketARN": "string",
"Prefix": "string",
"ErrorOutputPrefix": "string",
"BufferingHints":
"SizeInMBs": integer,
"IntervalInSeconds": integer
,
"CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
"EncryptionConfiguration":
"NoEncryptionConfig": "NoEncryption",
"KMSEncryptionConfig":
"AWSKMSKeyARN": "string"
,
"CloudWatchLoggingOptions":
"Enabled": true|false,
"LogGroupName": "string",
"LogStreamName": "string"
,
"DataFormatConversionConfiguration":
"SchemaConfiguration":
"RoleARN": "string",
"CatalogId": "string",
"DatabaseName": "string",
"TableName": "string",
"Region": "string",
"VersionId": "string"
,
"InputFormatConfiguration":
"Deserializer":
"OpenXJsonSerDe":
"ConvertDotsInJsonKeysToUnderscores": true|false,
"CaseInsensitive": true|false,
"ColumnToJsonKeyMappings": "string": "string"
...
,
"HiveJsonSerDe":
"TimestampFormats": ["string", ...]
,
"OutputFormatConfiguration":
"Serializer":
"ParquetSerDe":
"BlockSizeBytes": integer,
"PageSizeBytes": integer,
"Compression": "UNCOMPRESSED"|"GZIP"|"SNAPPY",
"EnableDictionaryCompression": true|false,
"MaxPaddingBytes": integer,
"WriterVersion": "V1"|"V2"
,
"OrcSerDe":
"StripeSizeBytes": integer,
"BlockSizeBytes": integer,
"RowIndexStride": integer,
"EnablePadding": true|false,
"PaddingTolerance": double,
"Compression": "NONE"|"ZLIB"|"SNAPPY",
"BloomFilterColumns": ["string", ...],
"BloomFilterFalsePositiveProbability": double,
"DictionaryKeyThreshold": double,
"FormatVersion": "V0_11"|"V0_12"
,
"Enabled": true|false
【讨论】:
感谢您的跟进。但是,谁需要使用它并做什么? @mon 它为您提供了很多选项,例如压缩、加密、s3 备份存储桶、日志记录。例如,您可以以压缩格式聚合所有流数据,以节省 s3 存储成本。您不必使用所有这些选项,但它们就在那里。以上是关于Kinesis Firehose - 啥是 S3 扩展目标配置?的主要内容,如果未能解决你的问题,请参考以下文章
使用 AWS kinesis-firehose 将数据写入文件
读取 Amazon Kinesis Firehose 流写入 s3 的数据
我可以在交付到 S3 之前在 Kinesis Firehose 中自定义分区吗?