Redshift Spectrum 比 Athena 慢很多？

Posted 2023-02-19

技术标签:

【中文标题】Redshift Spectrum 比 Athena 慢很多？【英文标题】：Redshift Spectrum much slower than Athena? 【发布时间】：2020-03-23 00:28:00 【问题描述】：

我们的数据以 JSON 格式存储在 S3 中，没有分区。直到今天我们只使用 athena，但现在我们尝试了 Redshift Spectrum。

我们两次运行相同的查询。一次使用 Redshift Spectrum，一次使用 Athena。两者都连接到 S3 中的相同数据。

使用 Redshift Spectrum 需要很长时间（超过 15 分钟）才能运行此报告，而使用 Athena 只需 10 秒即可运行。

我们在 aws 控制台中在两种情况下运行的查询是这样的：

SELECT "events"."persistentid" AS "persistentid",
  SUM(1) AS "sum_number_of_reco"
FROM "analytics"."events" "events"
GROUP BY "events"."persistentid"

知道发生了什么吗？谢谢

【问题讨论】：

这是无服务器和您的服务器之间的区别。 Redshift 频谱正在使用您的 Redshift 集群，您可以根据需要调整其大小，但它可能比分配给您的查询的 Athena 舰队小。 AWS 支持人员说这是因为我们有很多小文件（因为我们使用 Kineses Firehose，它每 5 分钟在 s3 中创建一个文件）... 【参考方案1】：

Redshift Spectrum 处理能力受 Redshift 集群大小的限制。

您可以从Improving Amazon Redshift Spectrum Query Performance找到信息

Amazon Redshift 查询计划器推送谓词和聚合尽可能转移到 Redshift Spectrum 查询层。大的时候从 Amazon S3 返回的数据量，处理是有限的由集群的资源决定。 Redshift Spectrum 自动扩展至处理大型请求。因此，您的整体表现会提高只要您可以将处理推送到 Redshift Spectrum 层。

另一方面，Athena 为查询使用优化的资源量，这可能比小型 Redshift 集群的 Spectrum 所能获得的还多。

我们对不同 Redshift 集群大小的 Redshift Spectrum 性能进行的测试证实了这一点。

【讨论】：

请注意，Redshift Spectrum 的核心处理在独立于 Redshift 的层上运行，并且不受 Redshift 集群的影响。

Improving Amazon S3 query performance with predicate pushdown The processing that is done in the Amazon Redshift Spectrum layer (the Amazon S3 scan, projection, filtering, and aggregation) is independent from any individual Amazon Redshift cluster.

aws.amazon.com/blogs/big-data/…

以上是关于Redshift Spectrum 比 Athena 慢很多？的主要内容，如果未能解决你的问题，请参考以下文章

如何使用 Psycopg2 在 Redshift Spectrum 中添加分区 -

Redshift Spectrum 使用两个日期字段对表进行分区

如何更改 Redshift Spectrum 中的外部表？

Redshift Spectrum 性能对比 Athena

查询字符串列的 Redshift Spectrum 数组

Redshift Spectrum 和 Hive Metastore - 模棱两可的错误