创建没有下推谓词的动态框架问题

Posted 2023-02-16

技术标签:

【中文标题】创建没有下推谓词的动态框架问题【英文标题】：Creating dynamic frame issue without the pushdown predicate 【发布时间】：2021-09-14 01:15:54 【问题描述】：

AWS 胶水新手，请原谅我的问题：为什么在创建动态框架时不包含下推谓词时会出现错误。我尝试在不使用谓词的情况下使用它，因为我将使用书签，因此无论日期分区如何，都只会处理新文件。

datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1 ,transformation_ctx = "datasourceDyF")
datasourceDyF.ToDF().show(20)

对

datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1,transformation_ctx = "datasourceDyF", push_down_predicate = "salesdate = '2020-01-01'")
datasourceDyF.ToDF().show(20)

代码 1 给出了这个错误：

py4j.protocol.Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 1.0 (TID 4, xxx.xx.xxx.xx, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

【问题讨论】：

【参考方案1】：

下推谓词

在连接 RDBMS / 表时实际上很好用，这有助于 spark 识别要加载到其 RAM / 内存中的数据（即加载下游系统不需要的数据是没有意义的）。使用它的好处 - 由于更少的数据执行以比全表加载更快的方式发生。

现在，在您的情况下，您的底层表可能是分区表，因此需要下推谓词。

【讨论】：

以上是关于创建没有下推谓词的动态框架问题的主要内容，如果未能解决你的问题，请参考以下文章

谓词下推

具有下推谓词的 AWS Glue Dynamic_frame 未正确过滤

谓词下推

SQLServer 列存储索引的性能问题：“Where OR”将影响谓词下推

如何防止谓词下推？

聊聊谓词下推的事