Pandas:仅当某个列值在过去 N 个月内出现 N 次时才保留行
Posted
技术标签:
【中文标题】Pandas:仅当某个列值在过去 N 个月内出现 N 次时才保留行【英文标题】:Pandas: Keep only the rows only if certain column value appears N times in past N months 【发布时间】:2021-10-29 20:29:57 【问题描述】:我想保留最近 3 个月(month_n = 4 到 6)中单个充值次数(recharge_number)超过 2 次的行。假设当前月份为 7。如果任何充值号码满足该条件,则保留与该号码关联的所有信息。
account_no recharge_number year month_n
52 1300002 2021 6
52 1300002 2021 5
52 1300002 2021 4
52 1300002 2021 1
52 1644460 2021 6
52 1644460 2021 5
52 1644460 2021 2
70 1553984 2020 12
70 1553984 2020 11
91 1915689 2021 6
91 1915689 2021 5
91 1915689 2021 4
91 1915689 2020 12
91 1915689 2020 11
91 1915689 2020 10
52 1300002 2020 9
输出:
account_no recharge_number year month_n
52 1300002 2021 6
52 1300002 2021 5
52 1300002 2021 4
91 1915689 2021 6
91 1915689 2021 5
91 1915689 2021 4
我通过以下代码进行了尝试。这是正确的方法还是有更好的解决方案?
df = df.groupby(['account_no','recharge_number','year','month']).recharge_number.agg('count').to_frame('recharge_count').reset_index()
df[((df.month_n >=4) & (df.year ==2021) & (df.recharge_count>=3))]
【问题讨论】:
为什么最后一行保留2020
?
@Chris,我的错!现在编辑!谢谢
【参考方案1】:
想法是使用几个月的周期,因此您可以简单地减去 n
周期并按 Series.between
过滤:
per = pd.Period('2021-07')
#exclusive, so 2021-04, 2021-05, 2021-06 is tested
prev = per - 4
print (prev)
2021-03
dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')
mask1 = df['per'].between(prev, per, inclusive=False)
df = df[mask1]
df=df[df.groupby(['account_no','recharge_number']).recharge_number.transform('size').gt(2)]
print (df)
account_no recharge_number year month_n per
0 52 1300002 2021 6 2021-06
1 52 1300002 2021 5 2021-05
2 52 1300002 2021 4 2021-04
9 91 1915689 2021 6 2021-06
10 91 1915689 2021 5 2021-05
11 91 1915689 2021 4 2021-04
编辑:这里是用于掩码之间的辅助列和sum
的计数(匹配行)的替代方法,最后一个掩码由&
链接,用于bitweise AND
和|
对于按位OR
:
cols = df.columns
per = pd.Period('2021-07')
dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')
#exclusive, so 2021-04, 2021-05, 2021-06 is tested
df['prev3'] = df['per'].between(per - 4, per, inclusive=False)
df['prev6'] = df['per'].between(per - 7, per, inclusive=False)
prev3_groups = df.groupby(['account_no','recharge_number']).prev3.transform('sum').gt(2)
prev6_groups = df.groupby(['account_no','recharge_number']).prev6.transform('sum').gt(5)
df = df.loc[(df['prev3'] & prev3_groups) | (df['prev6'] & prev6_groups), cols]
print (df)
account_no recharge_number year month_n
0 52 1300002 2021 6
1 52 1300002 2021 5
2 52 1300002 2021 4
9 91 1915689 2021 6
10 91 1915689 2021 5
11 91 1915689 2021 4
另一个测试数据:
print (df)
account_no recharge_number year month_n
0 52 1300002 2021 6
1 52 1644460 2021 5
2 52 1644460 2021 4
3 52 1644460 2021 1
4 52 1644460 2021 6
5 52 1644460 2021 5
6 52 1644460 2021 2
7 70 1553984 2020 12
8 70 1553984 2020 11
9 91 1915689 2021 6
10 91 1915689 2021 5
11 91 1915689 2021 4
12 91 1915689 2020 12
13 91 1915689 2020 11
14 91 1915689 2020 10
15 52 1300002 2020 9
cols = df.columns
per = pd.Period('2021-07')
dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')
#exclusive, so 2021-04, 2021-05, 2021-06 is tested
df['prev3'] = df['per'].between(per - 4, per, inclusive=False)
df['prev6'] = df['per'].between(per - 7, per, inclusive=False)
prev3_groups = df.groupby(['account_no','recharge_number']).prev3.transform('sum').gt(2)
prev6_groups = df.groupby(['account_no','recharge_number']).prev6.transform('sum').gt(5)
df = df.loc[(df['prev3'] & prev3_groups) | (df['prev6'] & prev6_groups), cols]
print (df)
account_no recharge_number year month_n
1 52 1644460 2021 5
2 52 1644460 2021 4
3 52 1644460 2021 1
4 52 1644460 2021 6
5 52 1644460 2021 5
6 52 1644460 2021 2
9 91 1915689 2021 6
10 91 1915689 2021 5
11 91 1915689 2021 4
【讨论】:
谢谢。有用。我可以在这里设置“或”条件吗?比如,保留充值的行(recharge_number)——过去3个月内单个充值次数超过2次(month_n = 4 to 6)或最近6个月内单个充值次数超过5次(month_n = 1 到 6) @asifabdullah - 请给我一些时间。【参考方案2】:您可以以您的日期为参考计算时间增量,并使用groupby.filter
和query
对您的行进行子集化:
# make datetime
date = pd.to_datetime(df[['year', 'month_n']].rename(columns='month_n': 'month').assign(day=1))
# check dates not older than 90 days from reference (2021-07)
df['gt_3months'] = date.gt(pd.to_datetime('2021-07')-pd.Timedelta('90days'))
# groupy account_no and filter
(df.groupby('account_no')
.filter(lambda g: g['gt_3months'].sum()>=2) # check that there are 2 or more occurrences in the last 3 months
.query('gt_3months == True') # keep only the recent occurrences
.drop('gt_3months', axis=1) # drop temporary column
)
输出:
account_no recharge_number year month_n
0 52 1300002 2021 6
1 52 1300002 2021 5
4 52 1644460 2021 6
5 52 1644460 2021 5
9 91 1915689 2021 6
10 91 1915689 2021 5
【讨论】:
以上是关于Pandas:仅当某个列值在过去 N 个月内出现 N 次时才保留行的主要内容,如果未能解决你的问题,请参考以下文章
如何在n个月内使用sql(无程序)获取每个月的第一天和最后一天