R - 为 Google BigQuery 导入清理数据

Posted

技术标签:

【中文标题】R - 为 Google BigQuery 导入清理数据【英文标题】:R - Cleaning Data for Google BigQuery Import 【发布时间】:2020-03-01 16:43:42 【问题描述】:

我正在清理 R 中的一些数据集以导入 Google BigQuery。清理过程涉及用 NA 替换极端/不正确的值,但保留行中的其余值。使用以下代码清理 R 中的数据后,我在数据框中得到了我期望的结果,然后将这些结果导出到 CSV 文件以加载到 Google BigQuery 中进行分析。

但是,在尝试将数据加载到 BQ 时,我收到以下错误消息:

Error while reading data, error message: Could not parse 'NA' as double for field Sea_Level_PressureIn (position 8) starting at location 132014

BQ 似乎无法处理 R 创建的 NA 值。从 BQ 文档中,当读入表时,空值将被替换为空值。有没有不同的方法可以使用 R 来创建“空值”以进行清洁以导入 BQ?我尝试使用空字符串,但这会将列数据类型更改为字符,而不是分析所需的整数或浮点数。

R 代码:

osh$TemperatureF[osh$TemperatureF==-9999]<-NA
osh$Dew.PointF[osh$Dew.PointF==-9999]<-NA
osh$Sea.Level.PressureIn[osh$Sea.Level.PressureIn==-9999]<-NA
osh$VisibilityMPH[osh$VisibilityMPH==-9999]<-NA
osh$Wind.SpeedMPH[osh$Wind.SpeedMPH=="Calm"]<-NA
osh$Wind.Direction[osh$Wind.Direction=="Calm"]<-NA
osh$Gust.SpeedMPH[osh$Gust.SpeedMPH=='-']<-NA
osh$PrecipitationIn[osh$PrecipitationIn=='N/A']<-NA
osh$Events[osh$Events=='']<-NA
osh[,c(7,11,12,13)] <- sapply(osh[,c(7,11,12,13)], as.numeric)
summary(osh)

iowa$TemperatureF[iowa$TemperatureF==-9999]<-NA
iowa$Dew.PointF[iowa$Dew.PointF==-9999]<-NA
iowa$Sea.Level.PressureIn[iowa$Sea.Level.PressureIn==-9999]<-NA
iowa$VisibilityMPH[iowa$VisibilityMPH==-9999]<-NA
iowa$Wind.SpeedMPH[iowa$Wind.SpeedMPH=="Calm"]<-NA
iowa$Wind.Direction[iowa$Wind.Direction=="Calm"]<-NA
iowa$Gust.SpeedMPH[iowa$Gust.SpeedMPH=='-']<-NA
iowa$PrecipitationIn[iowa$PrecipitationIn=='N/A']<-NA
iowa$Events[iowa$Events=='']<-NA
iowa[,c(7,11,12,13)] <- sapply(iowa[,c(7,11,12,13)], as.numeric)
summary(iowa)

数据文件1摘要(清理后):

Oshkosh Summary:
      Year          Month           Day            TimeCST        TemperatureF      Dew.PointF       Humidity     Sea.Level.PressureIn VisibilityMPH    Wind.Direction  Wind.SpeedMPH   Gust.SpeedMPH   
 Min.   :2000   Min.   : 1.0   Min.   : 1.00   1:53 AM :  5737   Min.   :-18.90   Min.   :-29.9   Min.   : 1.00   Min.   :28.79        Min.   : 0.100   South  :17644   Min.   : 1.00   Min.   : 2.00   
 1st Qu.:2004   1st Qu.: 3.0   1st Qu.: 8.00   7:53 PM :  5734   1st Qu.: 30.20   1st Qu.: 23.0   1st Qu.:47.00   1st Qu.:29.83        1st Qu.: 5.000   SSW    :17283   1st Qu.: 5.00   1st Qu.: 7.00   
 Median :2008   Median : 6.0   Median :16.00   12:53 AM:  5726   Median : 46.00   Median : 37.0   Median :61.00   Median :29.98        Median :10.000   West   :16789   Median :29.00   Median : 9.00   
 Mean   :2008   Mean   : 6.5   Mean   :15.74   6:53 PM :  5725   Mean   : 45.78   Mean   : 37.3   Mean   :57.21   Mean   :29.98        Mean   : 7.766   WNW    :14209   Mean   :21.78   Mean   :10.14   
 3rd Qu.:2012   3rd Qu.:10.0   3rd Qu.:23.00   3:53 PM :  5718   3rd Qu.: 63.00   3rd Qu.: 54.0   3rd Qu.:71.00   3rd Qu.:30.13        3rd Qu.:10.000   North  : 9862   3rd Qu.:37.00   3rd Qu.:13.00   
 Max.   :2015   Max.   :12.0   Max.   :31.00   4:53 AM :  5714   Max.   :100.00   Max.   : 80.6   Max.   :83.00   Max.   :30.91        Max.   :10.500   (Other):83876   Max.   :39.00   Max.   :39.00   
                                               (Other) :142441   NA's   :224      NA's   :449                     NA's   :247          NA's   :181      NA's   :17132   NA's   :17132   NA's   :148781  
 PrecipitationIn                Events               Conditions    WindDirDegrees 
 Min.   :  1.00   Snow             : 12356   Clear        :68796   Min.   :  0.0  
 1st Qu.:  1.00   Rain             : 11264   Overcast     :45823   1st Qu.: 70.0  
 Median :  1.00   Fog              :  2281   Light Snow   :12266   Median :190.0  
 Mean   :  3.99   Rain-Thunderstorm:  1835   Mostly Cloudy:12090   Mean   :174.3  
 3rd Qu.:  3.00   Fog-Snow         :   766   Light Rain   : 8486   3rd Qu.:270.0  
 Max.   :111.00   (Other)          :   976   Partly Cloudy: 7525   Max.   :360.0  
 NA's   :140721   NA's             :147317   (Other)      :21809      

数据文件2摘要(清理后):

Iowa Summary:
     Year          Month             Day            TimeCST        TemperatureF      Dew.PointF        Humidity    Sea.Level.PressureIn VisibilityMPH    Wind.Direction  Wind.SpeedMPH  
 Min.   :2000   Min.   : 1.000   Min.   : 1.00   5:52 PM :  4532   Min.   :-24.00   Min.   :-27.90   Min.   : 1.0   Min.   : 2.96        Min.   : 0.200   NW     :14823   Min.   : 1.0   
 1st Qu.:2004   1st Qu.: 3.000   1st Qu.: 8.00   6:52 PM :  4520   1st Qu.: 33.10   1st Qu.: 26.10   1st Qu.:45.0   1st Qu.:29.87        1st Qu.: 7.000   South  :12234   1st Qu.: 9.0   
 Median :2008   Median : 6.000   Median :16.00   12:52 AM:  4517   Median : 50.00   Median : 42.10   Median :63.0   Median :30.01        Median :10.000   SE     :11264   Median :39.0   
 Mean   :2008   Mean   : 6.478   Mean   :15.74   3:52 PM :  4509   Mean   : 49.31   Mean   : 41.05   Mean   :56.8   Mean   :30.01        Mean   : 8.092   West   :11239   Mean   :30.2   
 3rd Qu.:2011   3rd Qu.:10.000   3rd Qu.:23.00   11:52 PM:  4507   3rd Qu.: 66.90   3rd Qu.: 59.00   3rd Qu.:74.0   3rd Qu.:30.15        3rd Qu.:10.000   North  :10727   3rd Qu.:48.0   
 Max.   :2015   Max.   :12.000   Max.   :31.00   4:52 PM :  4505   Max.   :104.00   Max.   : 82.00   Max.   :85.0   Max.   :30.99        Max.   :10.000   (Other):83483   Max.   :50.0   
                                                 (Other) :145739   NA's   :354      NA's   :953                     NA's   :96           NA's   :579      NA's   :29059   NA's   :29059  
 Gust.SpeedMPH    PrecipitationIn                Events               Conditions    WindDirDegrees 
 Min.   : 1.00    Min.   :  1.00   Rain             : 11665   Clear        :52936   Min.   :  0.0  
 1st Qu.:17.00    1st Qu.:  2.00   Snow             :  8188   Overcast     :40694   1st Qu.: 40.0  
 Median :20.00    Median :  3.00   Rain-Thunderstorm:  3597   Partly Cloudy:29737   Median :150.0  
 Mean   :20.89    Mean   :  6.34   Fog              :  1709   Mostly Cloudy:12238   Mean   :160.2  
 3rd Qu.:24.00    3rd Qu.:  5.00   Thunderstorm     :  1223   Light Rain   : 9577   3rd Qu.:280.0  
 Max.   :72.00    Max.   :145.00   (Other)          :   567   Light Snow   : 8177   Max.   :360.0  
 NA's   :146525   NA's   :139963   NA's             :145880   (Other)      :19470   NA's   :45   

有什么想法吗?

【问题讨论】:

【参考方案1】:

完成此操作后,以下步骤有效:

    在 BigQuery 中使用所需架构创建一个空表 使用 CLI 加载表,并将 --null_marker 标志设置为“NA”值的值。完整的加载脚本如下所示。
bq load  --null_marker="NA" --skip_leading_rows=1 --source_format=CSV data-set.tableName gs://fileLocation

【讨论】:

以上是关于R - 为 Google BigQuery 导入清理数据的主要内容,如果未能解决你的问题,请参考以下文章

将 JSON 导入 Google BigQuery 时出现重复对象的问题

从 bigquery 导入到 google 表格限制为 10k 行

将 R data.frame/tbl 导出到 Google BigQuery 表

无法导入 com.google.api.services.bigquery.model.TableCell [重复]

ImportError:无法从“google.cloud”(未知位置)导入名称“bigquery”

Google 数据准备 - 无法从 BigQuery 导入表(从 Google 表格创建)“未找到”