R - 为 Google BigQuery 导入清理数据
Posted
技术标签:
【中文标题】R - 为 Google BigQuery 导入清理数据【英文标题】:R - Cleaning Data for Google BigQuery Import 【发布时间】:2020-03-01 16:43:42 【问题描述】:我正在清理 R 中的一些数据集以导入 Google BigQuery。清理过程涉及用 NA 替换极端/不正确的值,但保留行中的其余值。使用以下代码清理 R 中的数据后,我在数据框中得到了我期望的结果,然后将这些结果导出到 CSV 文件以加载到 Google BigQuery 中进行分析。
但是,在尝试将数据加载到 BQ 时,我收到以下错误消息:
Error while reading data, error message: Could not parse 'NA' as double for field Sea_Level_PressureIn (position 8) starting at location 132014
BQ 似乎无法处理 R 创建的 NA 值。从 BQ 文档中,当读入表时,空值将被替换为空值。有没有不同的方法可以使用 R 来创建“空值”以进行清洁以导入 BQ?我尝试使用空字符串,但这会将列数据类型更改为字符,而不是分析所需的整数或浮点数。
R 代码:
osh$TemperatureF[osh$TemperatureF==-9999]<-NA
osh$Dew.PointF[osh$Dew.PointF==-9999]<-NA
osh$Sea.Level.PressureIn[osh$Sea.Level.PressureIn==-9999]<-NA
osh$VisibilityMPH[osh$VisibilityMPH==-9999]<-NA
osh$Wind.SpeedMPH[osh$Wind.SpeedMPH=="Calm"]<-NA
osh$Wind.Direction[osh$Wind.Direction=="Calm"]<-NA
osh$Gust.SpeedMPH[osh$Gust.SpeedMPH=='-']<-NA
osh$PrecipitationIn[osh$PrecipitationIn=='N/A']<-NA
osh$Events[osh$Events=='']<-NA
osh[,c(7,11,12,13)] <- sapply(osh[,c(7,11,12,13)], as.numeric)
summary(osh)
iowa$TemperatureF[iowa$TemperatureF==-9999]<-NA
iowa$Dew.PointF[iowa$Dew.PointF==-9999]<-NA
iowa$Sea.Level.PressureIn[iowa$Sea.Level.PressureIn==-9999]<-NA
iowa$VisibilityMPH[iowa$VisibilityMPH==-9999]<-NA
iowa$Wind.SpeedMPH[iowa$Wind.SpeedMPH=="Calm"]<-NA
iowa$Wind.Direction[iowa$Wind.Direction=="Calm"]<-NA
iowa$Gust.SpeedMPH[iowa$Gust.SpeedMPH=='-']<-NA
iowa$PrecipitationIn[iowa$PrecipitationIn=='N/A']<-NA
iowa$Events[iowa$Events=='']<-NA
iowa[,c(7,11,12,13)] <- sapply(iowa[,c(7,11,12,13)], as.numeric)
summary(iowa)
数据文件1摘要(清理后):
Oshkosh Summary:
Year Month Day TimeCST TemperatureF Dew.PointF Humidity Sea.Level.PressureIn VisibilityMPH Wind.Direction Wind.SpeedMPH Gust.SpeedMPH
Min. :2000 Min. : 1.0 Min. : 1.00 1:53 AM : 5737 Min. :-18.90 Min. :-29.9 Min. : 1.00 Min. :28.79 Min. : 0.100 South :17644 Min. : 1.00 Min. : 2.00
1st Qu.:2004 1st Qu.: 3.0 1st Qu.: 8.00 7:53 PM : 5734 1st Qu.: 30.20 1st Qu.: 23.0 1st Qu.:47.00 1st Qu.:29.83 1st Qu.: 5.000 SSW :17283 1st Qu.: 5.00 1st Qu.: 7.00
Median :2008 Median : 6.0 Median :16.00 12:53 AM: 5726 Median : 46.00 Median : 37.0 Median :61.00 Median :29.98 Median :10.000 West :16789 Median :29.00 Median : 9.00
Mean :2008 Mean : 6.5 Mean :15.74 6:53 PM : 5725 Mean : 45.78 Mean : 37.3 Mean :57.21 Mean :29.98 Mean : 7.766 WNW :14209 Mean :21.78 Mean :10.14
3rd Qu.:2012 3rd Qu.:10.0 3rd Qu.:23.00 3:53 PM : 5718 3rd Qu.: 63.00 3rd Qu.: 54.0 3rd Qu.:71.00 3rd Qu.:30.13 3rd Qu.:10.000 North : 9862 3rd Qu.:37.00 3rd Qu.:13.00
Max. :2015 Max. :12.0 Max. :31.00 4:53 AM : 5714 Max. :100.00 Max. : 80.6 Max. :83.00 Max. :30.91 Max. :10.500 (Other):83876 Max. :39.00 Max. :39.00
(Other) :142441 NA's :224 NA's :449 NA's :247 NA's :181 NA's :17132 NA's :17132 NA's :148781
PrecipitationIn Events Conditions WindDirDegrees
Min. : 1.00 Snow : 12356 Clear :68796 Min. : 0.0
1st Qu.: 1.00 Rain : 11264 Overcast :45823 1st Qu.: 70.0
Median : 1.00 Fog : 2281 Light Snow :12266 Median :190.0
Mean : 3.99 Rain-Thunderstorm: 1835 Mostly Cloudy:12090 Mean :174.3
3rd Qu.: 3.00 Fog-Snow : 766 Light Rain : 8486 3rd Qu.:270.0
Max. :111.00 (Other) : 976 Partly Cloudy: 7525 Max. :360.0
NA's :140721 NA's :147317 (Other) :21809
数据文件2摘要(清理后):
Iowa Summary:
Year Month Day TimeCST TemperatureF Dew.PointF Humidity Sea.Level.PressureIn VisibilityMPH Wind.Direction Wind.SpeedMPH
Min. :2000 Min. : 1.000 Min. : 1.00 5:52 PM : 4532 Min. :-24.00 Min. :-27.90 Min. : 1.0 Min. : 2.96 Min. : 0.200 NW :14823 Min. : 1.0
1st Qu.:2004 1st Qu.: 3.000 1st Qu.: 8.00 6:52 PM : 4520 1st Qu.: 33.10 1st Qu.: 26.10 1st Qu.:45.0 1st Qu.:29.87 1st Qu.: 7.000 South :12234 1st Qu.: 9.0
Median :2008 Median : 6.000 Median :16.00 12:52 AM: 4517 Median : 50.00 Median : 42.10 Median :63.0 Median :30.01 Median :10.000 SE :11264 Median :39.0
Mean :2008 Mean : 6.478 Mean :15.74 3:52 PM : 4509 Mean : 49.31 Mean : 41.05 Mean :56.8 Mean :30.01 Mean : 8.092 West :11239 Mean :30.2
3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:23.00 11:52 PM: 4507 3rd Qu.: 66.90 3rd Qu.: 59.00 3rd Qu.:74.0 3rd Qu.:30.15 3rd Qu.:10.000 North :10727 3rd Qu.:48.0
Max. :2015 Max. :12.000 Max. :31.00 4:52 PM : 4505 Max. :104.00 Max. : 82.00 Max. :85.0 Max. :30.99 Max. :10.000 (Other):83483 Max. :50.0
(Other) :145739 NA's :354 NA's :953 NA's :96 NA's :579 NA's :29059 NA's :29059
Gust.SpeedMPH PrecipitationIn Events Conditions WindDirDegrees
Min. : 1.00 Min. : 1.00 Rain : 11665 Clear :52936 Min. : 0.0
1st Qu.:17.00 1st Qu.: 2.00 Snow : 8188 Overcast :40694 1st Qu.: 40.0
Median :20.00 Median : 3.00 Rain-Thunderstorm: 3597 Partly Cloudy:29737 Median :150.0
Mean :20.89 Mean : 6.34 Fog : 1709 Mostly Cloudy:12238 Mean :160.2
3rd Qu.:24.00 3rd Qu.: 5.00 Thunderstorm : 1223 Light Rain : 9577 3rd Qu.:280.0
Max. :72.00 Max. :145.00 (Other) : 567 Light Snow : 8177 Max. :360.0
NA's :146525 NA's :139963 NA's :145880 (Other) :19470 NA's :45
有什么想法吗?
【问题讨论】:
【参考方案1】:完成此操作后,以下步骤有效:
-
在 BigQuery 中使用所需架构创建一个空表
使用 CLI 加载表,并将
--null_marker
标志设置为“NA”值的值。完整的加载脚本如下所示。
bq load --null_marker="NA" --skip_leading_rows=1 --source_format=CSV data-set.tableName gs://fileLocation
【讨论】:
以上是关于R - 为 Google BigQuery 导入清理数据的主要内容,如果未能解决你的问题,请参考以下文章
将 JSON 导入 Google BigQuery 时出现重复对象的问题
从 bigquery 导入到 google 表格限制为 10k 行
将 R data.frame/tbl 导出到 Google BigQuery 表
无法导入 com.google.api.services.bigquery.model.TableCell [重复]