在 AWS Glue-ETL 中向目标表添加新列

Posted

技术标签:

【中文标题】在 AWS Glue-ETL 中向目标表添加新列【英文标题】:Add new Column to Target Table in AWS Glue-ETL 【发布时间】:2019-05-06 05:04:21 【问题描述】:

我是 AWS Glue ETL 的新手。我正在尝试执行一个简单的计算并将派生列添加到目标表列表中。当我查询时,我可以看到数据,但我很难将其添加到我的最终数据集中。请尽快帮助我。谢谢

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "stg", table_name = "xyz", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "stg", table_name = "wind_gust", transformation_ctx = "datasource0")
## ==== Transformation ======
datasource0.toDF().createOrReplaceTempView("view_dyf")
sqlDF = spark.sql("select * from view_dyf").show()
## convert units from EU  to US units
us_unit_conv =spark.sql("""SELECT IF (mesurement_type = 'm s-1', round(units * 1.151,2),
                    IF (mesurement_type = 'm', round(units / 1609.344,2),
                      IF (mesurement_type = 'Pa', round(units /6894.757,2),0) )
                      )as new_unit
            from view_dyf""")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("time", "string", "Time", "string"), ("latitude", "double", "Latitude", "double"), ("longitude", "double", "Longitude", "double"), ("units", "double", "EU_Units", "double"), ("mesurement_type", "string", "EU_Unit_Type", "string"), ("variable_name", "string", "Variable_Name", "string")], transformation_ctx = "applymapping1")

我添加了新的派生列 - ("us_unit_conv", "double", "US_Units", "double") 。请参考下文

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("time", "string", "Time", "string"), ("latitude", "double", "Latitude", "double"), ("longitude", "double", "Longitude", "double"), ("units", "double", "EU_Units", "double"), ("mesurement_type", "string", "EU_Unit_Type", "string"), ("us_unit_conv", "double", "US_Units", "double"), ("variable_name", "string", "Variable_Name", "string")], transformation_ctx = "applymapping1")

【问题讨论】:

【参考方案1】:

我认为您需要阅读更多有关应用映射的内容:链接。

    您指定了错误的框架,您指定了datasource0,但它应该是您的新框架us_unit_conv。由于这是您创建的框架,其中包含您的新变量。 映射也有点错误。 ("us_unit_conv", "double", "US_Units", "double"),这应该是("input_name", "input_type", "output_name", "output_type")。所以在你的情况下,我猜它会是("new_unit", "double", "US_Units", "double")。但您还需要使用 SELECT * 传递变量的其余部分。
s_unit_conv =spark.sql("""SELECT *,IF (mesurement_type = 'm s-1', round(units * 1.151,2),
                    IF (mesurement_type = 'm', round(units / 1609.344,2),
                      IF (mesurement_type = 'Pa', round(units /6894.757,2),0) )
                      )as new_unit
            from view_dyf""")

applymapping1 = ApplyMapping.apply(frame = s_unit_conv, mappings = [("new_unit", "double", "US_Units", "double"),("time", "string", "Time", "string"), ("latitude", "double", "Latitude", "double"), ("longitude", "double", "Longitude", "double"), ("units", "double", "EU_Units", "double"), ("mesurement_type", "string", "EU_Unit_Type", "string"), ("variable_name", "string", "Variable_Name", "string")], transformation_ctx = "applymapping1")

【讨论】:

以上是关于在 AWS Glue-ETL 中向目标表添加新列的主要内容,如果未能解决你的问题,请参考以下文章

在 Java 中向 BigQuery 表的架构添加新列

如何在 Spark SQL 中向现有 Dataframe 添加新列

如何将 SUPER 列添加到现有 AWS Redshift 表?

AWS DMS 添加新列以跟踪更改

R中向具有大量数据集的数据框添加新列的有效方法

在 AWS lightail LAMP 中向 htaccess 文件添加子域