aws glue / pyspark - 如何使用 Glue 以编程方式创建 Athena 表

Posted 2023-04-15

技术标签:

【中文标题】aws glue / pyspark - 如何使用 Glue 以编程方式创建 Athena 表【英文标题】：aws glue / pyspark - how to create Athena table programmatically using Glue 【发布时间】：2019-05-31 10:59:34 【问题描述】：

我在 AwsGlue 中运行一个脚本，该脚本从 s3 加载数据，进行一些转换并将结果保存到 S3。我正在尝试在此例程中再添加一个步骤。我想在 Athena 的现有数据库中创建一个新表。

我在 AWS 文档中找不到任何类似的示例。在我遇到的示例中，结果只是写到了 S3 中。这在 Glue 中可能吗？

有一些代码示例。应该如何修改以创建带有输出结果的 Athena 表？

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dataset", table_name = "table_1", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"), ("description", "string", "description", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = "path": "s3://...", format = "parquet", transformation_ctx = "datasink4")


*create Athena table with the output results*

job.commit()

【问题讨论】：

【参考方案1】：

我可以想到两种方法来做到这一点。一种是使用 sdk 获取对 athena API 的引用，并使用它通过 create table 语句执行查询，as seen at this blog post

另一种可能更有趣的方法是使用 Glue API 到 create a crawler 为您的 S3 存储桶，然后执行爬虫。

通过第二种方法，您的表被编目，您不仅可以从 athena、but also 来自 EMR 或 Redshift 频谱使用它。

【讨论】：

@Javier Ramirez，在单独的 py 脚本中使用 boto3 创建一个表将是一个解决选项，但是我想知道这个操作是否可以直接在胶水上进行。在 aws 文档 docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html 中提到了一个函数 create-table，但没有示例，我不清楚它是否可以用于脚本中的胶水，如果可以，那么函数应该是什么样子您始终可以在同一个脚本中创建 table_creation，不需要其他脚本。正如您所指出的，您还可以使用 Glue API 创建表。您指向的是 Web API 文档，该文档可通过 boto3 访问（与我指出的解决方案非常相似），您可以在 ***.com/questions/52318838/… 看到一个完整的示例。（更多在下一条评论）有另一个版本的 API 直接在 Glue 上（因此不涉及 boto3）记录在 docs.aws.amazon.com/glue/latest/dg/…。这个没有示例，但由于参数与网络上的相同，您可能可以以此为灵感。希望对你有帮助

以上是关于aws glue / pyspark - 如何使用 Glue 以编程方式创建 Athena 表的主要内容，如果未能解决你的问题，请参考以下文章