aws 胶水中的 catalog_connection 参数是啥？

Posted 2023-03-31

技术标签:

【中文标题】aws 胶水中的 catalog_connection 参数是啥？【英文标题】：What is catalog_connection param in aws glue?aws 胶水中的 catalog_connection 参数是什么？ 【发布时间】：2021-02-24 15:06:30 【问题描述】：

我希望每 4 小时定期运行一次 etl 作业，它将合并（组合）来自 s3 存储桶（parquet 格式）的数据和来自 redshift 的数据。找出唯一的，然后将其再次写入红移，替换旧的红移数据。对于将数据帧写入红移，this

glueContext.write_dynamic_frame.from_jdbc_conf(frame, catalog_connection, connection_options=, redshift_tmp_dir = "", transformation_ctx="")

Writes a DynamicFrame using the specified JDBC connection information.
frame – The DynamicFrame to write.
catalog_connection – A catalog connection to use.
connection_options – Connection options, such as path and database table (optional).
redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
transformation_ctx – A transformation context to use (optional).

似乎在路上。但是catalog_connection 是什么意思？它是指胶水目录吗？如果是，那么胶水目录中的内容是什么？

【问题讨论】：

【参考方案1】：

catalog_connection 指的是在胶合目录中定义的glue connection。

假设如果有一个名为redshift_connection的连接在胶水连接中，它将被用作：

glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, 
               catalog_connection = "redshift_connection",
               connection_options = "dbtable": df_name, "database": "testdb",
               redshift_tmp_dir = "s3://glue-sample-target/temp-dir/")

以下是一些详细示例：https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

【讨论】：

以上是关于aws 胶水中的 catalog_connection 参数是啥？的主要内容，如果未能解决你的问题，请参考以下文章

如何使用 AWS 胶水获取存储在 s3 中的模式或已处理的嵌套 json 文件压缩（gzip）？

aws 胶水 pyspark 删除数组中的结构，但保留数据并保存到 dynamodb

如何使用 pyspark 在 aws 胶水中展平嵌套 json 中的数组？

在 aws 胶水中使用 transformation_ctx 是啥？

aws 胶水主要丢弃空字段

为啥我的 aws 胶水作业只使用一个执行器和驱动程序？