处理 R 的 ff 库中用引号括起来的实数
Posted
技术标签:
【中文标题】处理 R 的 ff 库中用引号括起来的实数【英文标题】:Dealing with real numbers wrapped in quotes in ff library for R 【发布时间】:2018-06-18 18:56:01 【问题描述】:我正在尝试探索 2017 年 HMDA 数据。平面文件大约 9GB,可用here。 CSV 太大而无法读入内存,所以我尝试使用 ff
库。但是,当我尝试读取文件时出现错误。
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '"83.9800033569336"'`
当我只扫描前 1000 行时,错误消失了;但它从第 1000 行到第 10000 行之间开始:
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv', nrow = 1000)
我也尝试指定所有列类,但它返回错误:
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv',
nrow = 10000,
colClasses = c('real', 'real', 'integer', 'real', 'integer',
'integer', 'integer', 'integer', 'integer',
'factor', 'factor', 'character', 'character',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'character', 'integer', 'character',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor'))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '"63.5"'
当我将所有整数和实数转换为字符时,我仍然收到错误:
... Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,
: vmode 'character' not implemented
唯一可行的解决方案是指定colClasses = 'factor'
,将所有列转换为因子。
编辑:问题似乎与原始 CSV 文件有关。有些值用引号括起来,有些则没有。如果我将前 10,000 行导出到 CSV 并使用read.csv()
,它将按预期工作,数据类型为数字。但是在同一个子集上,如果我使用read.csv.ffdf()
,我会收到错误scan() expected 'a real', got '"63.5"'
。它是 CSV 的一部分,但 ffdf 也没有按预期读取 CSV。
因为read.csv()
有效,我尝试将文件分块为 15 个不同的数据帧,每个数据帧包含 1,000,000 行。但是,它在到达第 11 个文件时一直冻结,可能是因为它只是为了找到第 11,000,000 行而将其加载到内存中。
所以问题是,如何让ff
处理用引号括起来不一致的实数?或者你如何清理原始数据以删除引号?或者您如何以有效使用 RAM 的方式对数据进行分块?
仅供参考,这里是数据头:
> glimpse(hmda.ff[,])
Observations: 14,285,496
Variables: 47
$ tract_to_msamd_income <fct> 63.5, 238.1199951171875, 38.189998626708984, 132.32000732421875, 87.5, 138.16000366210938, 98.43000030517578, 93.04000091552...
$ rate_spread <fct> , , , , , , , , , , , , , , , , 01.85, , , , 03.92, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ population <fct> 7067, 5429, 6869, 3835, 1960, 7120, 1828, 4643, 16372, 2977, 9630, 3298, 4487, 3324, 4099, 3835, 6003, 5187, 4818, 5849, 422...
$ minority_population <fct> 72.08000183105469, 6.559999942779541, 30.719999313354492, 65.73999786376953, 55.459999084472656, 23.309999465942383, 13.5699...
$ number_of_owner_occupied_units <fct> 1201, 1611, 236, 1027, 407, 2037, 615, 854, 3292, 317, 3052, 1104, 617, 1099, 1409, 1027, 1122, 1638, 1495, 1508, 1187, 1700...
$ number_of_1_to_4_family_units <fct> 1303, 1807, 794, 1141, 601, 2431, 725, 1936, 5286, 1174, 3188, 1175, 1120, 1404, 1522, 1141, 1520, 2162, 1989, 2080, 1421, 2...
$ loan_amount_000s <fct> 400, 525, 225, 621, 181, 70, 123, 5, 100, 34, 302, 680, 108, 99, 100, 100, 171, 443, 420, 50, 75, 361, 179, 338, 300, 544, 3...
$ hud_median_family_income <fct> 107600, 77500, 61800, 75200, 50000, 68800, 79600, 75200, 58400, 70800, 79600, 107600, 79300, 83900, 108300, 75200, 63200, 72...
$ applicant_income_000s <fct> 90, 300, , 255, 109, 238, 84, 75, 44, 195, 62, 159, 50, 84, 70, 124, 80, 264, 177, 214, 181, 57, 86, 157, 64, 96, , 30, 50, ...
$ state_name <fct> Virginia, Illinois, Michigan, California, California, South Carolina, Michigan, California, Florida, Pennsylvania, Michigan,...
$ state_abbr <fct> VA, IL, MI, CA, CA, SC, MI, CA, FL, PA, MI, VA, CA, CO, CT, CA, CA, WI, NY, CA, CA, CA, NE, VA, NY, CA, CA, FL, SC, CA, VA, ...
$ sequence_number <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ respondent_id <fct> 7442300004, 0000852218, 0000146672, 0000852218, 86-0860478, 0000617677, 7197000003, 0000504713, 39-2001010, 3027509990, 0000...
$ purchaser_type_name <fct> Life insurance company, credit union, mortgage bank, or finance company, Loan was not originated or was not sold in calendar...
$ property_type_name <fct> One-to-four family dwelling (other than manufactured housing), One-to-four family dwelling (other than manufactured housing)...
$ preapproval_name <fct> Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicab...
$ owner_occupancy_name <fct> Owner-occupied as a principal dwelling, Owner-occupied as a principal dwelling, Not owner-occupied as a principal dwelling, ...
$ msamd_name <fct> Washington, Arlington, Alexandria - DC, VA, MD, WV, Chicago, Naperville, Arlington Heights - IL, Kalamazoo, Portage - MI, Sa...
$ loan_type_name <fct> Conventional, Conventional, Conventional, Conventional, FHA-insured, Conventional, FHA-insured, Conventional, Conventional, ...
$ loan_purpose_name <fct> Home purchase, Refinancing, Home purchase, Refinancing, Home purchase, Home improvement, Refinancing, Home improvement, Home...
$ lien_status_name <fct> Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien,...
$ hoepa_status_name <fct> Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan...
$ edit_status_name <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ denial_reason_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Insufficient cash (downpayment, closing costs), , , , , ...
$ denial_reason_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Debt-to-income ratio, , Debt-to-income ratio, , , , , , ...
$ denial_reason_name_1 <fct> , , , Debt-to-income ratio, , Credit history, , Credit history, , Credit application incomplete, , , , , , , , , , , , Credi...
$ county_name <fct> Fairfax County, Cook County, Kalamazoo County, Sacramento County, Fresno County, Charleston County, Macomb County, Sacrament...
$ co_applicant_sex_name <fct> Male, No co-applicant, No co-applicant, Female, Female, Information not provided by applicant in mail, Internet, or telephon...
$ co_applicant_race_name_5 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_4 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_1 <fct> White, No co-applicant, No co-applicant, White, Asian, Information not provided by applicant in mail, Internet, or telephone...
$ co_applicant_ethnicity_name <fct> Not Hispanic or Latino, No co-applicant, No co-applicant, Not Hispanic or Latino, Not Hispanic or Latino, Information not pr...
$ census_tract_number <fct> 4522.00, 8198.01, 0015.07, 0093.30, 0049.02, 0046.09, 2515.00, 0018.00, 0432.04, 0007.00, 2234.00, 4703.00, 0017.00, 0102.10...
$ as_of_year <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ application_date_indicator <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_sex_name <fct> Female, Male, Male, Male, Male, Information not provided by applicant in mail, Internet, or telephone application, Female, F...
$ applicant_race_name_5 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_4 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_1 <fct> White, White, White, White, Asian, Information not provided by applicant in mail, Internet, or telephone application, White,...
$ applicant_ethnicity_name <fct> Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Info...
$ agency_name <fct> Department of Housing and Urban Development, Consumer Financial Protection Bureau, Consumer Financial Protection Bureau, Con...
$ agency_abbr <fct> HUD, CFPB, CFPB, CFPB, HUD, CFPB, HUD, CFPB, FDIC, HUD, CFPB, FRS, CFPB, NCUA, CFPB, NCUA, HUD, FRS, HUD, CFPB, NCUA, HUD, C...
$ action_taken_name <fct> Loan originated, Loan originated, Loan originated, Application denied by financial institution, Loan originated, Application...
【问题讨论】:
【参考方案1】:我创建了一个将数据从因子转换为数值的函数。由于某种原因,这两个函数在处理ff
中的虚拟数据帧时是不同的:
hmda[1] <- as.numeric(paste0(hmda[1]))
hmda$first_col <- as.numeric(paste0(hmda$first_col))
第一行将返回一堆 NA(尽管非常不一致),而第二个函数实际上按预期工作。所以下面是有效的脚本:
require(ff)
# function that converts all numeric-looking fields to numeric
hmda_cleanup <- function(hmda)
hmda$tract_to_msamd_income <- as.numeric(paste0(hmda$tract_to_msamd_income))
hmda$rate_spread <- as.numeric(paste0(hmda$rate_spread))
hmda$population <- as.numeric(paste0(hmda$population))
hmda$minority_population <- as.numeric(paste0(hmda$minority_population))
hmda$number_of_owner_occupied_units <- as.numeric(paste0(hmda$number_of_owner_occupied_units))
hmda$number_of_1_to_4_family_units <- as.numeric(paste0(hmda$number_of_1_to_4_family_units))
hmda$loan_amount_000s <- as.numeric(paste0(hmda$loan_amount_000s))
hmda$hud_median_family_income <- as.numeric(paste0(hmda$hud_median_family_income))
hmda$applicant_income_000s <- as.numeric(paste0(hmda$applicant_income_000s))
hmda$as_of_year <- as.numeric(paste0(hmda$as_of_year))
return(hmda)
# read in large csv with all values as factors
hmda.ff <- read.csv.ffdf(file ='hmda_lar_2017.csv',
colClasses = 'factor')
# access the list(?) containing the data
hmda.ff.df <- hmda.ff[,]
# run user-defined function on the data
hmda.ff.df <- hmda_cleanup(hmda.ff.df)
【讨论】:
以上是关于处理 R 的 ff 库中用引号括起来的实数的主要内容,如果未能解决你的问题,请参考以下文章