如何根据条件合并两个数据框？

Posted 2023-03-29

技术标签:

【中文标题】如何根据条件合并两个数据框？【英文标题】：How can I merge two dataframes based on a condition? 【发布时间】：2020-12-21 03:08:33 【问题描述】：

这是我的问题here:的后续问题

这是我的交易数据

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1055597     9866    9736        2010-12-31  278.9
1053519     9868    8644        2010-12-31  242.8
1052790     9869    8399        2010-12-31  372.2

现在，对于from 列中的每个帐户，我想计算他们在过去 6 个月内收到的交易金额。为此：

df <- data # df is just a copy of "data"
setDT(df)[, total_trx_amount_received_in_last_6month := sapply(date, function(x) 
                         sum(amount[between(date, x-180, x)])), to] 

# since I want to merge "df" and "data" based on the columns "from" and "date", I change the name of the column "to" and make it "from"
df <- select(df, to,date,total_trx_amount_received_in_last_6month) %>% rename(from=to)

df

from    date        total_trx_amount_received_in_last_6month
<fctr>  <date>      <dbl>
7468    2005-01-04  700.0       
6213    2005-01-08  12032.0     
7517    2005-01-10  1000.0      
6143    2005-01-12  4976.0      
6254    2005-01-14  200.0       
6669    2005-01-20  200.0       
6934    2005-01-24  72160.0     
9240    2005-01-26  21061.0     
6374    2005-01-30  1000.0      
6143    2005-01-31  4989.4

现在我想将这个新列 total_trx_amount_received_in_last_6month 添加到原来的 data 中。因此，我应该将这两个数据框 data 和 df 合并到 from 和 date 列，但日期的匹配标准是一系列值，而不是单个值。例如账户7468，如果原来的data包含一笔交易7468，交易日期在"2004-07-08"-"2005-01-04"的区间内（即最近6个月的时间段，从"2005-01-04"开始），则对应df$total_trx_amount_received_in_last_6month 中的值 700.0 应添加到 data$total_trx_amount_received_in_last_6month

我该怎么做？

【问题讨论】：

看起来您“分解”了几个数据列，但未能识别出该错误。我真的不明白你的意思。你在说什么错误？ from 和 to 两列都是因子。似乎很清楚您希望它们成为日期。 【参考方案1】：

没有足够的数据来测试这一点，但您可以将两个数据框和 replace total_trx_amount_received_in_last_6month 连接到 NA，其中两个日期之间的差异大于 180 天。

library(dplyr)

data %>%
left_join(df, by = 'from') %>%
  mutate(total_trx_amount_received_in_last_6month = replace(
            total_trx_amount_received_in_last_6month, 
            (date.y - date.x) > 180, NA))

使用data.table，您可以：

library(data.table)
setDT(data)
df1 <- df[data, on = 'from']

df1[, total_trx_amount_received_in_last_6month := replace(
  total_trx_amount_received_in_last_6month, 
  (date - i.date) > 180, NA)]

【讨论】：

应该小于180。但它不起作用，我猜是有一些错误，因为计算还没有完成，它给出了内存分配错误：Error: cannot allocate vector of size 1.2 Gb 代码保留所有差值小于180的值，并将其他值转为NA。该错误意味着您没有足够的内存来进行如此大的处理，请查看此帖子 ***.com/questions/5171593/… 。我已经用data.table 解决方案更新了答案，检查是否有帮助。它给出了错误：object 'i.date' not found。在df1 <- df[data, on = c('from','date')] 中也不应该是date？我认为像这样合并两个数据集不会给我们一个正确的结果，因为可能会丢失一些信息。我把它作为一个新问题发布了，你能检查一下吗？ @Ronak Shah

以上是关于如何根据条件合并两个数据框？的主要内容，如果未能解决你的问题，请参考以下文章