组合行 + 连接大型数据集的值(将 SQL 导出转换为多值)
Posted
技术标签:
【中文标题】组合行 + 连接大型数据集的值(将 SQL 导出转换为多值)【英文标题】:Combining rows + Concatenating values for a large dataset (Converting SQL export to multivalued) 【发布时间】:2014-08-11 09:41:15 【问题描述】:我有 5+ 百万条 CSV 记录的 SQL 导出。我想将具有相同 PDP_ID 字段的行组合起来,并将它们的值从一列连接到一个新列中
我正在使用以下函数,但它们执行的时间太长而且似乎没有进展:
PDP_ID <- unique(data$PDP_ID)
getDetailNumbers <- function(i)(paste(data$DETAIL_NUMBER[data$PDP_ID==i],collapse="@"))
DETAIL_NUMBERS <- aaply(PDP_ID,1,getDetailNumbers,.expand=FALSE,.progress="text")
在获得 (PDP_ID, DETAIL_NUMBERS) 数据框后,我的计划是将其与原始数据框合并。
PDP_ID 包含大约 410 万条记录。处理这种情况的最快方法是什么?拆分文件? “数据”数据帧按 PDP_ID 排序。我还尝试使用降雪包来使用两个 cpu 内核,但无济于事。
Sample data:
"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",
Desired result:
"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12@15@17
467626,"a third entry",1@2
638676,"another which wont be combined",
【问题讨论】:
请提供一些示例数据和所需的输出 谢谢,添加示例数据/所需结果 我不明白你的数据。您有 4 个列名,但只有 3 个列 【参考方案1】:您的数据有点奇怪,因为您有 4 个列名,只有 3 个列,所以我删除了一个列名。
无论如何,使用data.table
这应该很快
首先,你的数据
df <- read.csv(text = '"PDP_ID","STREETNAME_DUTCH","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",')
解决办法
library(data.table)
setDT(df)[ , list(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@")), by = PDP_ID]
结果
# PDP_ID STREETNAME_DUTCH DETAIL_NUMBER
# 1: 111115 An entry which wont be combined NA
# 2: 231313 Street two 12@15@17
# 3: 467626 a third entry 1@2
# 4: 638676 another which wont be combined NA
或者,你可以试试dplyr
(也很快)
重要提示:dtach
plyr
包优先,使用detach("package:plyr", unload=TRUE)
解决办法
library(dplyr)
df %>%
group_by(PDP_ID) %>%
summarise(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@"))
结果
# Source: local data frame [4 x 3]
#
# PDP_ID STREETNAME_DUTCH DETAIL_NUMBER
# 1 111115 An entry which wont be combined NA
# 2 231313 Street two 12@15@17
# 3 467626 a third entry 1@2
# 4 638676 another which wont be combined NA
【讨论】:
非常感谢,dplyr 示例在合理时间内按预期工作! 我让它运行了大约 20 分钟,然后再尝试使用 dplyr。但是,我确实在 setDT 示例中包含了大约 10 个额外的列(COL2 = COL2[1] 等...)(为简单起见,我从示例数据中省略了),我现在将其包含在 dplyr 中。以上是关于组合行 + 连接大型数据集的值(将 SQL 导出转换为多值)的主要内容,如果未能解决你的问题,请参考以下文章
使用空数据集的Spark SQL连接会导致更大的输出文件大小