AWS Glue - Redshift 中具有 Json 结构的字段
Posted
技术标签:
【中文标题】AWS Glue - Redshift 中具有 Json 结构的字段【英文标题】:AWS Glue - Field with Json structure in Redshift 【发布时间】:2021-07-08 14:58:11 【问题描述】:您好,我正在使用 AWS Glue 尝试将数据从 S3 中的 Json 文件加载到 Redshift。我正在使用路径为 $[*] 的 Json 爬虫,由于某种原因,其中一个字段(等级)以 Json 结构进入表:
关于如何使“等级”仅显示等级本身的值的任何想法?我是否需要针对这项工作调整 PySpark 脚本?
这是目前为止的脚本:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import current_date
from awsglue.dynamicframe import DynamicFrame
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "linkredshift", table_name = "uni3", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "linkredshift", table_name = "uni3", transformation_ctx = "datasource0")
df=datasource0.toDF().withColumn('data_date',current_date())
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("subject", "string", "subject", "string"), ("grade", "string", "grade", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("subject", "string", "subject", "string"), ("grade", "string", "grade", "string"), ("data_date", "date", "data_date", "date")], transformation_ctx = "applymapping1")
## @type: SelectFields
## @args: [paths = ["subject", "grade", "last_name", "first_name"], transformation_ctx = "selectfields2"]
## @return: selectfields2
## @inputs: [frame = applymapping1]
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["subject", "grade", "last_name", "first_name", "data_date"], transformation_ctx = "selectfields2")
## @type: ResolveChoice
## @args: [choice = "MATCH_CATALOG", database = "linkredshift", table_name = "dev_public_students", transformation_ctx = "resolvechoice3"]
## @return: resolvechoice3
## @inputs: [frame = selectfields2]
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "linkredshift", table_name = "dev_public_students", transformation_ctx = "resolvechoice3")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice4"]
## @return: resolvechoice4
## @inputs: [frame = resolvechoice3]
resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")
## @type: DataSink
## @args: [database = "linkredshift", table_name = "dev_public_students", redshift_tmp_dir = TempDir, transformation_ctx = "datasink5"]
## @return: datasink5
## @inputs: [frame = resolvechoice4]
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "linkredshift", table_name = "dev_public_students", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
job.commit()
【问题讨论】:
【参考方案1】:由于等级字段可以是字符串或整数,因此需要进行带有强制转换的解析选择,使其成为其中之一。尝试将您的 applymapping1 行更改为这两行:
datasource0= datasource0.resolveChoice(specs=[("grade",'cast:int')])
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("subject", "string", "subject", "string"), ("grade", "int", "grade", "int"), ("data_date", "date", "data_date", "date")], transformation_ctx = "applymapping1")
请注意,这会将您的成绩字段更改为 int 类型(因为这是我在 resolveChoice 中强制转换的内容,以便使用 int 中的值)。以后可以随意将成绩字段转换为字符串,但 int 似乎是更好的选择。
参考这里:https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html
【讨论】:
由于某种原因,当我用上面的两行替换 applymapping1 时,数据没有被加载,我只剩下一个空表和 2 个等级列:“grade”和“grade_int”。你能确认一下吗? 糟糕,这应该是一个项目,而不是演员表。我更新了代码。 现在我改用“项目”运行,所有其他列都已加载但分级。此外,我仍然得到 2 列:“grade”和“grade_int”。你能确认一下吗?谢谢 嗯,你能在 resolvechoice 行之后添加一个datasource0.printSchema()
并告诉我架构日志中的内容吗?注意我恢复到解决方案中的强制转换,因为这应该可以工作,并且我添加了一个 AWS 链接作为一些背景。
当然可以,现在运行!以上是关于AWS Glue - Redshift 中具有 Json 结构的字段的主要内容,如果未能解决你的问题,请参考以下文章
AWS Glue to Redshift:是否可以替换,更新或删除数据?