在 AWS Glue 中转换其他列的数据类型时,某些列变为空

Posted

技术标签:

【中文标题】在 AWS Glue 中转换其他列的数据类型时,某些列变为空【英文标题】:Some columns become null when converting data type of other columns in AWS Glue 【发布时间】:2020-10-06 02:39:41 【问题描述】:

我正在尝试使用 AWS Glue 将 csv 数据从 AWS S3 移动到 AWS Redshift。我正在移动的数据使用非标准格式记录每个条目的时间戳(例如 01-JAN-2020 01.02.03),因此我的胶水爬虫将此列作为字符串拾取。

在我的作业脚本中,我正在使用 pyspark 中的“to_timestamp”函数将此列转换为时间戳,这似乎工作正常。但是,因此,数据类型为“long”的列不会转移到 redshift,并且这些列的值都是空的。

当我在不转换时间戳列(即仅生成的脚本)的情况下运行脚本时,数据类型为“long”的列没有此问题,并且它们正确显示在 redshift 中。

这是我的代码:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import to_timestamp, col

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0")

# Convert to data frame and perform ETL
dataFrame = datasource0.toDF().withColumn("rec_open_ts", to_timestamp(col("rec_open_ts"),"dd-MMM-yyyy HH.mm.ss"))
# Convert back to a dynamic frame
editedData = DynamicFrame.fromDF(dataFrame, glueContext, "editedData")

## @type: ApplyMapping
## @args: [mapping = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = editedData, mappings = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "RedshiftCluster", connection_options = "dbtable": "gprs_reports", "database": "telenordatasync", redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "RedshiftCluster", connection_options = "dbtable": "gprs_reports", "database": "telenordatasync", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

我在这里有什么明显的遗漏吗?非常感谢!

编辑:

运行editedData.PrintSchema() 后显示的架构是:

|-- rec_open_ts: timestamp |-- chg_id: struct | |-- long: long | |-- string: string |-- rec_seq_num: struct | |-- long: long | |-- string: string |-- imsi: struct | |-- long: long | |-- string: string |-- msisdn: struct | |-- long: long | |-- string: string |-- terminal_ip_address: string |-- pdp_type: struct | |-- long: long | |-- string: string |-- ggsn_ip_address: string |-- sgsn_ip_address: string |-- country: string |-- operator: string |-- apn: string |-- duration: struct | |-- long: long | |-- string: string |-- record_close_cause_code: struct | |-- long: long | |-- string: string |-- uploaded_data(b): struct | |-- long: long | |-- string: string |-- downloaded_data(b): struct | |-- long: long | |-- string: string 

(long 是结构的一部分?)

运行editedData.Show(10) 后,应显示在redshift 中的数据。长列之一的示例:

 "chg_id": "long": 123456789, "string": null

编辑 2:

在没有 ETL(时间戳保留为字符串)的情况下运行 datasource0.printSchema() 后,架构为:

|-- rec_open_ts: string |-- chg_id: choice | |-- long | |-- string |-- rec_seq_num: choice | |-- long | |-- string |-- imsi: choice | |-- long | |-- string |-- msisdn: choice | |-- long | |-- string |-- terminal_ip_address: string |-- pdp_type: choice | |-- long | |-- string |-- ggsn_ip_address: string |-- sgsn_ip_address: string |-- country: string |-- operator: string |-- apn: string |-- duration: choice | |-- long | |-- string |-- record_close_cause_code: choice | |-- long | |-- string |-- uploaded_data(b): choice | |-- long | |-- string |-- downloaded_data(b): choice | |-- long | |-- string 

似乎当我转换时间戳列时,长列变成了结构。这是为什么呢?

【问题讨论】:

你能用editedData.printSchena()和editedData.show(10)验证editedData的数据类型和实际数据吗? 是的 - 我会添加结果 【参考方案1】:

对于遇到此问题的其他人,我已经找到了解决方案:

当类型不明确时(即在这种情况下,爬虫推断出一个 long 但该列中有一个不是 long 的值),该类型被记录为推断类型和字符串之间的选择。如果歧义没有得到解决,那么在从动态帧转换为数据帧时,这个选择就会变成一个结构体,并且不会在红移中正确显示。

因此,在执行任何 ETL 之前,我使用“resolveChoice”方法解决了这些选择。这是我更新的代码:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import to_timestamp, col

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0")

# Resolve type choices
resolvedData = datasource0.resolveChoice(specs = [('chg_id','cast:long')]).resolveChoice(specs = [('rec_seq_num','cast:long')]).resolveChoice(specs = [('imsi','cast:long')]).resolveChoice(specs = [('msisdn','cast:long')]).resolveChoice(specs = [('pdp_type','cast:long')]).resolveChoice(specs = [('duration','cast:long')]).resolveChoice(specs = [('record_close_cause_code','cast:long')]).resolveChoice(specs = [('uploaded_data(b)','cast:long')]).resolveChoice(specs = [('downloaded_data(b)','cast:long')])

# Convert to data frame and perform ETL
dataFrame = resolvedData.toDF().withColumn("rec_open_ts", to_timestamp(col("rec_open_ts"),"dd-MMM-yyyy HH.mm.ss"))
# Convert back to a dynamic frame
editedData = DynamicFrame.fromDF(dataFrame, glueContext, "editedData")

print("Printed Schema")
editedData.printSchema()

## @type: ApplyMapping
## @args: [mapping = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = editedData, mappings = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "RedshiftCluster", connection_options = "dbtable": "gprs_reports", "database": "telenordatasync", redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "RedshiftCluster", connection_options = "dbtable": "gprs_reports", "database": "telenordatasync", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

【讨论】:

以上是关于在 AWS Glue 中转换其他列的数据类型时,某些列变为空的主要内容,如果未能解决你的问题,请参考以下文章

AWS Glue 布尔转换

AWS Glue 数据从 S3 迁移到 Redshift

如何使用 AWS Glue 将许多 CSV 文件转换为 Parquet

AWS Glue PySpark 无法计算记录

从 AWS Glue 表到 RedShift Spectrum 外部表的日期字段转换

如何创建分区具有不同列的 AWS Glue 表? ('HIVE_PARTITION_SCHEMA_MISMATCH')