AWS Glue 作业将 Null 写入 Redshift
Posted
技术标签:
【中文标题】AWS Glue 作业将 Null 写入 Redshift【英文标题】:AWS Glue Job writes Null to Redshift 【发布时间】:2021-12-06 12:56:07 【问题描述】:我在 s3 存储桶文件夹中有多个 JSON 文件,每个文件都具有与以下示例 JSON 对象数组/列表相同的模式。
file1
["coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":53501.08,"coin1hrChange":-0.6,"coin24hrChange":-6.0,"coin7dChange":-9.2,"coin24hrVol":38266934579,"coinMarketCap":1012650219321,"fetchTime":"2021-12-03 23:55:42.654921","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":21934372.7784000002,"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4225.28,"coin1hrChange":-0.3,"coin24hrChange":-7.2,"coin7dChange":-6.4,"coin24hrVol":27395766224,"coinMarketCap":502376237337,"fetchTime":"2021-12-03 23:55:42.655698","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":1732280.2944,"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":593.95,"coin1hrChange":-0.7,"coin24hrChange":-4.9,"coin7dChange":-6.9,"coin24hrVol":2379210538,"coinMarketCap":100022794436,"fetchTime":"2021-12-03 23:55:42.656393","rankDate":"2021-12-03","rate":409.98,"coinPriceNaira":243507.621]
file2
["coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":52936.1,"coin1hrChange":-1.5,"coin24hrChange":-6.5,"coin7dChange":-1.7,"coin24hrVol":38241025550,"coinMarketCap":998999157967,"fetchTime":"2021-12-04 02:33:23.182164","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":21702742.2780000009,"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4159.85,"coin1hrChange":-1.4,"coin24hrChange":-8.1,"coin7dChange":2.8,"coin24hrVol":28661534477,"coinMarketCap":493429600914,"fetchTime":"2021-12-04 02:33:23.182785","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":1705455.3030000003,"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":582.32,"coin1hrChange":-1.9,"coin24hrChange":-5.4,"coin7dChange":-0.6,"coin24hrVol":1059743631,"coinMarketCap":97824378011,"fetchTime":"2021-12-04 02:33:23.183415","rankDate":"2021-12-04","rate":409.98,"coinPriceNaira":238739.5536]
file3
["coinRank":1,"coinId":"bitcoin","coinName":"Bitcoin","coinSymbol":"BTC","coinLoc":"bitcoin","coinPrice":49375.27,"coin1hrChange":-0.7,"coin24hrChange":4.3,"coin7dChange":-9.5,"coin24hrVol":35860857801.0,"coinMarketCap":932932346783,"fetchTime":"2021-12-05 14:34:49.339803","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":20281615.4014549591,"coinRank":2,"coinId":"ethereum","coinName":"Ethereum","coinSymbol":"ETH","coinLoc":"ethereum","coinPrice":4218.99,"coin1hrChange":-0.7,"coin24hrChange":7.1,"coin7dChange":3.3,"coin24hrVol":27778808883.0,"coinMarketCap":500688046117,"fetchTime":"2021-12-05 14:34:49.340495","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":1733011.9422655201,"coinRank":3,"coinId":"binancecoin","coinName":"Binance Coin","coinSymbol":"BNB","coinLoc":"binance-coin","coinPrice":574.23,"coin1hrChange":-0.5,"coin24hrChange":5.2,"coin7dChange":-4.0,"coin24hrVol":2265817636.0,"coinMarketCap":96576091895,"fetchTime":"2021-12-05 14:34:49.341177","rankDate":"2021-12-05","rate":410.764648,"coinPriceNaira":235873.38382104]
使用 AWS Glue 爬虫和分类器分离 JSON 对象$[*]
我已经拆分了记录,并且我可以确认数据目录中的记录数与文件中的记录数匹配。
但是,当我将数据推送到红移时,我有一些列显示为空。如有必要,我还可以分享我的胶水脚本。
【问题讨论】:
【参考方案1】:我发现了数据集的问题所在,DataFrame 在列上推断出不同的数据类型 int64 和 float64,当 Glue 在 Redshift 中创建表时,它会将数字列创建为双精度 (float64) 因此,整数记录在 Redshift 上未正确转换。
-
我使用
.astype()
函数在 Pandas DataFrame 中手动指定了列类型
我删除了 redshift 中的表,同时删除了数据目录数据库中的表
重新抓取数据库并重新运行作业。
现在每个数据点都在 redshift 上显示得很好。
【讨论】:
以上是关于AWS Glue 作业将 Null 写入 Redshift的主要内容,如果未能解决你的问题,请参考以下文章
AWS Glue 作业以 Parquet 格式写入 s3 并出现 Not Found 错误
Python/Pyspark 迭代代码(用于 AWS Glue ETL 作业)