R Dataframe过滤:使用基于时间因素的唯一或重复功能

Posted

技术标签:

【中文标题】R Dataframe过滤:使用基于时间因素的唯一或重复功能【英文标题】:R Dataframe filtering: Using unique or duplicate function based on time factor 【发布时间】:2021-11-17 21:21:47 【问题描述】:

我正在尝试过滤贷款数据的数据框,但如果每个月度报告仍然未偿还贷款,则会重复贷款,或者如果已付款则放弃贷款(不能只使用最新的月度报告)。我想通过贷方过滤贷款的唯一到期日期,并删除重复项并仅保留报告日期的最新数据。以下是数据示例:

df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
                                  ,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
                 , Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
                                 ,"Lender1","Lender1","Lender2","Lender2")
                 , Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
                                      ,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
                 , Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
                                 ,150000,150000,2750000,2750000))

从示例数据中可以看出,Lender1 有 2 个唯一的到期日。第一个日期有 3 笔贷款在 2 个报告日期重复,第二个到期日有 1 笔贷款重复。我想删除重复项以保留最新的报告数据。我希望之后得到一个看起来像这样的数据框:

Reporting.date Lender.name Date.of.maturity Loan.amount
8/31/2021 Lender1 6/20/2025 13129474
8/31/2021 Lender1 6/20/2025 14643881
8/31/2021 Lender1 6/20/2025 44935677
7/31/2021 Lender1 6/30/2022 150000
12/31/2020 Lender2 5/15/2024 2750000

【问题讨论】:

【参考方案1】:

您需要将Reporting.date 转换为日期格式,可以是mutate(和我一样),也可以直接转换为filter

library(tidyverse)

df %>%
  mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  filter(Reporting.date == max(Reporting.date)) %>%
  ungroup()

【讨论】:

【参考方案2】:

我们也可以通过arrange 来实现这一点

library(dplyr)
library(lubridate)
df %>%
  arrange(Lender.name, Date.of.maturity, Loan.amount, 
         desc(mdy(Reporting.date))) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  slice_head(n = 1) %>%
  ungroup

-输出

# A tibble: 5 x 4
  Reporting.date Lender.name Date.of.maturity Loan.amount
  <chr>          <chr>       <chr>                  <dbl>
1 8/31/2021      Lender1     6/20/2025           13129474
2 8/31/2021      Lender1     6/20/2025           14643881
3 8/31/2021      Lender1     6/20/2025           44935677
4 7/31/2021      Lender1     6/30/2022             150000
5 12/31/2020     Lender2     5/15/2024            2750000

【讨论】:

【参考方案3】:

使用 subsettransformave 的基本 R 选项 -

subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')), 
       Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))

#   Reporting.date Lender.name Date.of.maturity Loan.amount
#4      2021-08-31     Lender1        6/20/2025    13129474
#5      2021-08-31     Lender1        6/20/2025    14643881
#6      2021-08-31     Lender1        6/20/2025    44935677
#8      2021-07-31     Lender1        6/30/2022      150000
#10     2020-12-31     Lender2        5/15/2024     2750000

【讨论】:

以上是关于R Dataframe过滤:使用基于时间因素的唯一或重复功能的主要内容,如果未能解决你的问题,请参考以下文章

df.unique() 基于列的整个 DataFrame

Pyspark Dataframe 组通过过滤

python和R对dataframe进行连接行过滤更新列内容:dplyrmergeinnerleftrightinner_joinleft_joinsort_valuesloc

如何在python中同时使用applymap、lambda和dataframe来过滤/修改dataframe?

在日期列比较上过滤 DataFrame

R语言dplyr处理dataframe:使用mutate函数生成新的列recode函数进行数据编码rename函数重命名字段arrange排序数据列select筛选数据filter过滤数据