R Dataframe过滤:使用基于时间因素的唯一或重复功能
Posted
技术标签:
【中文标题】R Dataframe过滤:使用基于时间因素的唯一或重复功能【英文标题】:R Dataframe filtering: Using unique or duplicate function based on time factor 【发布时间】:2021-11-17 21:21:47 【问题描述】:我正在尝试过滤贷款数据的数据框,但如果每个月度报告仍然未偿还贷款,则会重复贷款,或者如果已付款则放弃贷款(不能只使用最新的月度报告)。我想通过贷方过滤贷款的唯一到期日期,并删除重复项并仅保留报告日期的最新数据。以下是数据示例:
df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
, Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
,"Lender1","Lender1","Lender2","Lender2")
, Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
, Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
,150000,150000,2750000,2750000))
从示例数据中可以看出,Lender1 有 2 个唯一的到期日。第一个日期有 3 笔贷款在 2 个报告日期重复,第二个到期日有 1 笔贷款重复。我想删除重复项以保留最新的报告数据。我希望之后得到一个看起来像这样的数据框:
Reporting.date | Lender.name | Date.of.maturity | Loan.amount |
---|---|---|---|
8/31/2021 | Lender1 | 6/20/2025 | 13129474 |
8/31/2021 | Lender1 | 6/20/2025 | 14643881 |
8/31/2021 | Lender1 | 6/20/2025 | 44935677 |
7/31/2021 | Lender1 | 6/30/2022 | 150000 |
12/31/2020 | Lender2 | 5/15/2024 | 2750000 |
【问题讨论】:
【参考方案1】:您需要将Reporting.date
转换为日期格式,可以是mutate
(和我一样),也可以直接转换为filter
。
library(tidyverse)
df %>%
mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
filter(Reporting.date == max(Reporting.date)) %>%
ungroup()
【讨论】:
【参考方案2】:我们也可以通过arrange
来实现这一点
library(dplyr)
library(lubridate)
df %>%
arrange(Lender.name, Date.of.maturity, Loan.amount,
desc(mdy(Reporting.date))) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
slice_head(n = 1) %>%
ungroup
-输出
# A tibble: 5 x 4
Reporting.date Lender.name Date.of.maturity Loan.amount
<chr> <chr> <chr> <dbl>
1 8/31/2021 Lender1 6/20/2025 13129474
2 8/31/2021 Lender1 6/20/2025 14643881
3 8/31/2021 Lender1 6/20/2025 44935677
4 7/31/2021 Lender1 6/30/2022 150000
5 12/31/2020 Lender2 5/15/2024 2750000
【讨论】:
【参考方案3】:使用 subset
、transform
和 ave
的基本 R 选项 -
subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')),
Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))
# Reporting.date Lender.name Date.of.maturity Loan.amount
#4 2021-08-31 Lender1 6/20/2025 13129474
#5 2021-08-31 Lender1 6/20/2025 14643881
#6 2021-08-31 Lender1 6/20/2025 44935677
#8 2021-07-31 Lender1 6/30/2022 150000
#10 2020-12-31 Lender2 5/15/2024 2750000
【讨论】:
以上是关于R Dataframe过滤:使用基于时间因素的唯一或重复功能的主要内容,如果未能解决你的问题,请参考以下文章
python和R对dataframe进行连接行过滤更新列内容:dplyrmergeinnerleftrightinner_joinleft_joinsort_valuesloc
如何在python中同时使用applymap、lambda和dataframe来过滤/修改dataframe?
R语言dplyr处理dataframe:使用mutate函数生成新的列recode函数进行数据编码rename函数重命名字段arrange排序数据列select筛选数据filter过滤数据