处理 R 的 ff 库中用引号括起来的实数

Posted

技术标签:

【中文标题】处理 R 的 ff 库中用引号括起来的实数【英文标题】:Dealing with real numbers wrapped in quotes in ff library for R 【发布时间】:2018-06-18 18:56:01 【问题描述】:

我正在尝试探索 2017 年 HMDA 数据。平面文件大约 9GB,可用here。 CSV 太大而无法读入内存,所以我尝试使用 ff 库。但是,当我尝试读取文件时出现错误。

> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'a real', got '"83.9800033569336"'`

当我只扫描前 1000 行时,错误消失了;但它从第 1000 行到第 10000 行之间开始:

&gt; hmda.ff &lt;- read.csv.ffdf(file = 'hmda_lar.csv', nrow = 1000)

我也尝试指定所有列类,但它返回错误:

> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv',
                      nrow = 10000,
                      colClasses = c('real', 'real', 'integer', 'real', 'integer', 
                                     'integer', 'integer', 'integer', 'integer', 
                                     'factor', 'factor', 'character', 'character', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'character', 'integer', 'character', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor'))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'a real', got '"63.5"'

当我将所有整数和实数转换为字符时,我仍然收到错误:

... Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, 
  : vmode 'character' not implemented

唯一可行的解​​决方案是指定colClasses = 'factor',将所有列转换为因子。

编辑:问题似乎与原始 CSV 文件有关。有些值用引号括起来,有些则没有。如果我将前 10,000 行导出到 CSV 并使用read.csv(),它将按预期工作,数据类型为数字。但是在同一个子集上,如果我使用read.csv.ffdf(),我会收到错误scan() expected 'a real', got '"63.5"'。它是 CSV 的一部分,但 ffdf 也没有按预期读取 CSV。

因为read.csv() 有效,我尝试将文件分块为 15 个不同的数据帧,每个数据帧包含 1,000,000 行。但是,它在到达第 11 个文件时一直冻结,可能是因为它只是为了找到第 11,000,000 行而将其加载到内存中。

所以问题是,如何让ff 处理用引号括起来不一致的实数?或者你如何清理原始数据以删除引号?或者您如何以有效使用 RAM 的方式对数据进行分块?

仅供参考,这里是数据头:

> glimpse(hmda.ff[,])
Observations: 14,285,496
Variables: 47
$ tract_to_msamd_income          <fct> 63.5, 238.1199951171875, 38.189998626708984, 132.32000732421875, 87.5, 138.16000366210938, 98.43000030517578, 93.04000091552...
$ rate_spread                    <fct> , , , , , , , , , , , , , , , , 01.85, , , , 03.92, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ population                     <fct> 7067, 5429, 6869, 3835, 1960, 7120, 1828, 4643, 16372, 2977, 9630, 3298, 4487, 3324, 4099, 3835, 6003, 5187, 4818, 5849, 422...
$ minority_population            <fct> 72.08000183105469, 6.559999942779541, 30.719999313354492, 65.73999786376953, 55.459999084472656, 23.309999465942383, 13.5699...
$ number_of_owner_occupied_units <fct> 1201, 1611, 236, 1027, 407, 2037, 615, 854, 3292, 317, 3052, 1104, 617, 1099, 1409, 1027, 1122, 1638, 1495, 1508, 1187, 1700...
$ number_of_1_to_4_family_units  <fct> 1303, 1807, 794, 1141, 601, 2431, 725, 1936, 5286, 1174, 3188, 1175, 1120, 1404, 1522, 1141, 1520, 2162, 1989, 2080, 1421, 2...
$ loan_amount_000s               <fct> 400, 525, 225, 621, 181, 70, 123, 5, 100, 34, 302, 680, 108, 99, 100, 100, 171, 443, 420, 50, 75, 361, 179, 338, 300, 544, 3...
$ hud_median_family_income       <fct> 107600, 77500, 61800, 75200, 50000, 68800, 79600, 75200, 58400, 70800, 79600, 107600, 79300, 83900, 108300, 75200, 63200, 72...
$ applicant_income_000s          <fct> 90, 300, , 255, 109, 238, 84, 75, 44, 195, 62, 159, 50, 84, 70, 124, 80, 264, 177, 214, 181, 57, 86, 157, 64, 96, , 30, 50, ...
$ state_name                     <fct> Virginia, Illinois, Michigan, California, California, South Carolina, Michigan, California, Florida, Pennsylvania, Michigan,...
$ state_abbr                     <fct> VA, IL, MI, CA, CA, SC, MI, CA, FL, PA, MI, VA, CA, CO, CT, CA, CA, WI, NY, CA, CA, CA, NE, VA, NY, CA, CA, FL, SC, CA, VA, ...
$ sequence_number                <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ respondent_id                  <fct> 7442300004, 0000852218, 0000146672, 0000852218, 86-0860478, 0000617677, 7197000003, 0000504713, 39-2001010, 3027509990, 0000...
$ purchaser_type_name            <fct> Life insurance company, credit union, mortgage bank, or finance company, Loan was not originated or was not sold in calendar...
$ property_type_name             <fct> One-to-four family dwelling (other than manufactured housing), One-to-four family dwelling (other than manufactured housing)...
$ preapproval_name               <fct> Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicab...
$ owner_occupancy_name           <fct> Owner-occupied as a principal dwelling, Owner-occupied as a principal dwelling, Not owner-occupied as a principal dwelling, ...
$ msamd_name                     <fct> Washington, Arlington, Alexandria - DC, VA, MD, WV, Chicago, Naperville, Arlington Heights - IL, Kalamazoo, Portage - MI, Sa...
$ loan_type_name                 <fct> Conventional, Conventional, Conventional, Conventional, FHA-insured, Conventional, FHA-insured, Conventional, Conventional, ...
$ loan_purpose_name              <fct> Home purchase, Refinancing, Home purchase, Refinancing, Home purchase, Home improvement, Refinancing, Home improvement, Home...
$ lien_status_name               <fct> Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien,...
$ hoepa_status_name              <fct> Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan...
$ edit_status_name               <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ denial_reason_name_3           <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Insufficient cash (downpayment, closing costs), , , , , ...
$ denial_reason_name_2           <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Debt-to-income ratio, , Debt-to-income ratio, , , , , , ...
$ denial_reason_name_1           <fct> , , , Debt-to-income ratio, , Credit history, , Credit history, , Credit application incomplete, , , , , , , , , , , , Credi...
$ county_name                    <fct> Fairfax County, Cook County, Kalamazoo County, Sacramento County, Fresno County, Charleston County, Macomb County, Sacrament...
$ co_applicant_sex_name          <fct> Male, No co-applicant, No co-applicant, Female, Female, Information not provided by applicant in mail, Internet, or telephon...
$ co_applicant_race_name_5       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_4       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_3       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_2       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_1       <fct> White, No co-applicant, No co-applicant, White, Asian, Information not provided by applicant in mail, Internet, or telephone...
$ co_applicant_ethnicity_name    <fct> Not Hispanic or Latino, No co-applicant, No co-applicant, Not Hispanic or Latino, Not Hispanic or Latino, Information not pr...
$ census_tract_number            <fct> 4522.00, 8198.01, 0015.07, 0093.30, 0049.02, 0046.09, 2515.00, 0018.00, 0432.04, 0007.00, 2234.00, 4703.00, 0017.00, 0102.10...
$ as_of_year                     <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ application_date_indicator     <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_sex_name             <fct> Female, Male, Male, Male, Male, Information not provided by applicant in mail, Internet, or telephone application, Female, F...
$ applicant_race_name_5          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_4          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_3          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_2          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_1          <fct> White, White, White, White, Asian, Information not provided by applicant in mail, Internet, or telephone application, White,...
$ applicant_ethnicity_name       <fct> Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Info...
$ agency_name                    <fct> Department of Housing and Urban Development, Consumer Financial Protection Bureau, Consumer Financial Protection Bureau, Con...
$ agency_abbr                    <fct> HUD, CFPB, CFPB, CFPB, HUD, CFPB, HUD, CFPB, FDIC, HUD, CFPB, FRS, CFPB, NCUA, CFPB, NCUA, HUD, FRS, HUD, CFPB, NCUA, HUD, C...
$ action_taken_name              <fct> Loan originated, Loan originated, Loan originated, Application denied by financial institution, Loan originated, Application...

【问题讨论】:

【参考方案1】:

我创建了一个将数据从因子转换为数值的函数。由于某种原因,这两个函数在处理ff中的虚拟数据帧时是不同的:

hmda[1] <- as.numeric(paste0(hmda[1]))
hmda$first_col <- as.numeric(paste0(hmda$first_col))

第一行将返回一堆 NA(尽管非常不一致),而第二个函数实际上按预期工作。所以下面是有效的脚本:

require(ff)

# function that converts all numeric-looking fields to numeric
hmda_cleanup <- function(hmda)
  hmda$tract_to_msamd_income <- as.numeric(paste0(hmda$tract_to_msamd_income))
  hmda$rate_spread <- as.numeric(paste0(hmda$rate_spread))
  hmda$population <- as.numeric(paste0(hmda$population))
  hmda$minority_population <- as.numeric(paste0(hmda$minority_population))
  hmda$number_of_owner_occupied_units <- as.numeric(paste0(hmda$number_of_owner_occupied_units))
  hmda$number_of_1_to_4_family_units <- as.numeric(paste0(hmda$number_of_1_to_4_family_units))
  hmda$loan_amount_000s <- as.numeric(paste0(hmda$loan_amount_000s))
  hmda$hud_median_family_income <- as.numeric(paste0(hmda$hud_median_family_income))
  hmda$applicant_income_000s <- as.numeric(paste0(hmda$applicant_income_000s))
  hmda$as_of_year <- as.numeric(paste0(hmda$as_of_year))
  return(hmda)


# read in large csv with all values as factors
hmda.ff <- read.csv.ffdf(file ='hmda_lar_2017.csv', 
                         colClasses = 'factor')

# access the list(?) containing the data
hmda.ff.df <- hmda.ff[,]

# run user-defined function on the data
hmda.ff.df <- hmda_cleanup(hmda.ff.df)

【讨论】:

以上是关于处理 R 的 ff 库中用引号括起来的实数的主要内容,如果未能解决你的问题,请参考以下文章

为啥在打字稿中用方括号括起来的名称声明方法?

代码块:在Java中用{}括起来的代码

正则表达式查找数字并在段落中用括号括起来

更改 CSV 文本中的数字区域设置格式(在数字中用逗号交换点)

如何在R语言中用apply等函数替代for循环

Python 小节回顾