请求帮助在 R 中转换奇数数据帧
Posted
技术标签:
【中文标题】请求帮助在 R 中转换奇数数据帧【英文标题】:Request for helping converting odd data frame in R 【发布时间】:2018-10-08 19:13:33 【问题描述】:我从与我合作的公司收到了一些数据,但不知道如何将其转换为可用于分析的广泛数据格式。数据框有 15,800,000 行,只有 5 个变量。然而,第 4 和第 5 个变量是我必须使用的(约 90 个)变量之一的名称和响应。为了让事情变得更复杂,这些问题被问了不止一次,所以有多个回答。
但是,如果有超过 1 个可能的响应,则响应会跳到下一行(见下文)。
id date answer_instance pdl_variable_name answer_option
1 25839 2014-02-01 4 discretspend (25228) 14
2 25839 2014-02-05 11 legal_services (25495) [99]
3 25839 2014-12-07 6 comppen_company (706) [97]
4 25837 2014-12-15 2 Affluence_V2_P_2014 (34264) 8
5 25837 2015-01-20 5 study_qualification_children (35100) [98]
6 25837 2015-08-05 4 overall_debt (27281) [99]
7 25837 2015-09-03 3 benefits_received (25465) [98]
8 25834 2015-09-13 5 privpen_company (707) [96]
9 25834 2015-11-12 3 pocket_money_frequency (27076) 10
10 25835 2016-01-18 4 unemployment_status (21922) 6
11 25835 2016-02-05 8 legal_services (25495) [99]
12 25822 2016-02-11 3 assets_total_investable (26413) 3
13 25822 2016-03-03 2 disability_benefits_received (25055) [99]
14 25822 2018-04-01 1 insurance_held_2018 (58085) [1
15 4]
16 25811 2018-04-13 1 insurance_held (615) [1
17 4 11 20]
18 25811 2018-04-26 2 profile_work_stat (25617) 5
理想情况下,我想将其转换为可用于分析的长/宽格式。
【问题讨论】:
看library(reshape2) reshape2::dcast()
。没有可重复的示例,很难提供帮助。
【参考方案1】:
library(readr)
library(tidyverse)
library(splitstackshape)
#read file
txt <- read_lines(file = "file_path/test.txt")
#identify continuation of previous row and add it to the previous row
idx <- which(grepl('\\]\\s+$', txt))
txt <- gsub("^\\d+\\s+", "", txt) #remove row number from each row
txt[idx-1] <- paste(txt[idx-1], trimws(txt[idx]))
txt <- txt[-c(1,idx)]
#add a separator ";" to identify different columns in each row
txt <- gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+\\s+\\S+)\\s+(.*)", "\\1;\\2;\\3;\\4;\\5", txt)
#prepare data for analysis
df <- as.data.frame(txt) %>%
cSplit("txt", sep = ";") %>%
`colnames<-`(c("id", "date", "answer_instance", "pdl_variable_name", "answer_option")) %>%
mutate(answer_option = gsub("\\[|\\]|(\\s2)+", "", answer_option)) %>%
separate_rows("answer_option", sep=" ")
df
输出为:
id date answer_instance pdl_variable_name answer_option
1 25839 2014-02-01 4 discretspend (25228) 14
2 25839 2014-02-05 11 legal_services (25495) 99
3 25839 2014-12-07 6 comppen_company (706) 97
4 25837 2014-12-15 2 Affluence_V2_P_2014 (34264) 8
5 25837 2015-01-20 5 study_qualification_children (35100) 98
6 25837 2015-08-05 4 overall_debt (27281) 99
7 25837 2015-09-03 3 benefits_received (25465) 98
8 25834 2015-09-13 5 privpen_company (707) 96
9 25834 2015-11-12 3 pocket_money_frequency (27076) 10
10 25835 2016-01-18 4 unemployment_status (21922) 6
11 25835 2016-02-05 8 legal_services (25495) 99
12 25822 2016-02-11 3 assets_total_investable (26413) 3
13 25822 2016-03-03 2 disability_benefits_received (25055) 99
14 25822 2018-04-01 1 insurance_held_2018 (58085) 1
15 25822 2018-04-01 1 insurance_held_2018 (58085) 4
16 25811 2018-04-13 1 insurance_held (615) 1
17 25811 2018-04-13 1 insurance_held (615) 4
18 25811 2018-04-13 1 insurance_held (615) 11
19 25811 2018-04-13 1 insurance_held (615) 20
20 25811 2018-04-26 2 profile_work_stat (25617) 5
样本数据:
id date answer_instance pdl_variable_name answer_option
1 25839 2014-02-01 4 discretspend (25228) 14
2 25839 2014-02-05 11 legal_services (25495) [99]
3 25839 2014-12-07 6 comppen_company (706) [97]
4 25837 2014-12-15 2 Affluence_V2_P_2014 (34264) 8
5 25837 2015-01-20 5 study_qualification_children (35100) [98]
6 25837 2015-08-05 4 overall_debt (27281) [99]
7 25837 2015-09-03 3 benefits_received (25465) [98]
8 25834 2015-09-13 5 privpen_company (707) [96]
9 25834 2015-11-12 3 pocket_money_frequency (27076) 10
10 25835 2016-01-18 4 unemployment_status (21922) 6
11 25835 2016-02-05 8 legal_services (25495) [99]
12 25822 2016-02-11 3 assets_total_investable (26413) 3
13 25822 2016-03-03 2 disability_benefits_received (25055) [99]
14 25822 2018-04-01 1 insurance_held_2018 (58085) [1
15 4]
16 25811 2018-04-13 1 insurance_held (615) [1
17 4 11 20]
18 25811 2018-04-26 2 profile_work_stat (25617) 5
在test.txt
【讨论】:
【参考方案2】:OP 提供的数据中的主要问题似乎是单个记录溢出到下一行。因此,一旦行排列得当,就可以很容易地转换任何形式的数据进行分析。
]
上的正向前瞻 ^(?=.*])
和 [
上的负向前瞻 (?!.*\\[)
已用于确定一行是否是部分行并且是前一行的第二部分。
space
和 (
的 pdl_variable_name
列已更改为 _(
,以便可以使用 read.table
将其作为单列读取
library(tidyverse)
library(splitstackshape)
# Read from text file linewise
df_line <- data.frame(fileText = readLines("Answer.txt"), stringsAsFactors = FALSE)
tidy_text <- df_line %>% mutate(rn = row_number()) %>% # To merge partial row
mutate(rn = ifelse(grepl("^(?=.*])(?!.*\\[)",df$fileText, perl = TRUE),lag(rn), rn)) %>% #doesnot contain [ but contains ]
group_by(rn) %>%
summarise(fileText = paste0(trimws(fileText), collapse=" ")) %>%
ungroup() %>%
mutate(fileText = gsub("\\s(\\()", "_\\1", fileText)) %>%
mutate(fileText = gsub("\\[|]", "\\'", fileText)) # [1 4] is changed to '1 4'
# Concatenate rows prepared above separated by '\n' so that it read as dataframe
tidy_data <- read.table(text = paste0(trimws(tidy_text$fileText), collapse="\n"), header = TRUE, stringsAsFactors = FALSE)
#Use cSplit to split answers in multiple columns
tidy_data <- tidy_data %>%
mutate(pdl_variable_name = gsub("_(\\()", " \\1", pdl_variable_name)) %>%
cSplit("answer_option", sep=" ")
结果:
tidy_data
# id date answer_instance pdl_variable_name answer_option_1 answer_option_2 answer_option_3 answer_option_4
# 1: 25839 2014-02-01 4 discretspend (25228) 14 NA NA NA
# 2: 25839 2014-02-05 11 legal_services (25495) 99 NA NA NA
# 3: 25839 2014-12-07 6 comppen_company (706) 97 NA NA NA
# 4: 25837 2014-12-15 2 Affluence_V2_P_2014 (34264) 8 NA NA NA
# 5: 25837 2015-01-20 5 study_qualification_children (35100) 98 NA NA NA
# 6: 25837 2015-08-05 4 overall_debt (27281) 99 NA NA NA
# 7: 25837 2015-09-03 3 benefits_received (25465) 98 NA NA NA
# 8: 25834 2015-09-13 5 privpen_company (707) 96 NA NA NA
# 9: 25834 2015-11-12 3 pocket_money_frequency (27076) 10 NA NA NA
# 10: 25835 2016-01-18 4 unemployment_status (21922) 6 NA NA NA
# 11: 25835 2016-02-05 8 legal_services (25495) 99 NA NA NA
# 12: 25822 2016-02-11 3 assets_total_investable (26413) 3 NA NA NA
# 13: 25822 2016-03-03 2 disability_benefits_received (25055) 99 NA NA NA
# 14: 25822 2018-04-01 1 insurance_held_2018 (58085) 1 4 NA NA
# 15: 25811 2018-04-13 1 insurance_held (615) 1 4 11 20
# 16: 25811 2018-04-26 2 profile_work_stat (25617) 5 NA NA NA
原始数据:
OP 提供的answer.txt
的内容:
id date answer_instance pdl_variable_name answer_option
25839 2014-02-01 4 discretspend (25228) 14
25839 2014-02-05 11 legal_services (25495) [99]
25839 2014-12-07 6 comppen_company (706) [97]
25837 2014-12-15 2 Affluence_V2_P_2014 (34264) 8
25837 2015-01-20 5 study_qualification_children (35100) [98]
25837 2015-08-05 4 overall_debt (27281) [99]
25837 2015-09-03 3 benefits_received (25465) [98]
25834 2015-09-13 5 privpen_company (707) [96]
25834 2015-11-12 3 pocket_money_frequency (27076) 10
25835 2016-01-18 4 unemployment_status (21922) 6
25835 2016-02-05 8 legal_services (25495) [99]
25822 2016-02-11 3 assets_total_investable (26413) 3
25822 2016-03-03 2 disability_benefits_received (25055) [99]
25822 2018-04-01 1 insurance_held_2018 (58085) [1
4]
25811 2018-04-13 1 insurance_held (615) [1
4 11 20]
25811 2018-04-26 2 profile_work_stat (25617) 5
【讨论】:
【参考方案3】:如果我们忽略多个答案的数据格式,您可以像这样dcast
您的数据:
library(data.table)
dt <- data.table(df)
dt.wide <- dcast(
formula = date + answer_instance ~ pdl_variable_name,
data = dt,
value.var = "answer_option"
)
对于多个选项,如果您想使用 R,您需要以可以读入 data.frame
的格式询问数据。
在多行上使用单个单元格并不是交换数据的最佳方式。如果这是您可以要求将值括在引号中的唯一方法。
由于您的文件比较大,我建议使用data.table
,它具有快速的fread
功能。
【讨论】:
以上是关于请求帮助在 R 中转换奇数数据帧的主要内容,如果未能解决你的问题,请参考以下文章