Pandas：仅当某个列值在过去 N 个月内出现 N 次时才保留行

Posted 2023-03-11

技术标签:

【中文标题】Pandas：仅当某个列值在过去 N 个月内出现 N 次时才保留行【英文标题】：Pandas: Keep only the rows only if certain column value appears N times in past N months 【发布时间】：2021-10-29 20:29:57 【问题描述】：

我想保留最近 3 个月（month_n = 4 到 6）中单个充值次数（recharge_number）超过 2 次的行。假设当前月份为 7。如果任何充值号码满足该条件，则保留与该号码关联的所有信息。

account_no  recharge_number     year    month_n       
52          1300002            2021    6        
52          1300002            2021    5
52          1300002            2021    4
52          1300002            2021    1
52          1644460            2021    6
52          1644460            2021    5
52          1644460            2021    2
70          1553984            2020    12
70          1553984            2020    11
91          1915689            2021    6
91          1915689            2021    5
91          1915689            2021    4
91          1915689            2020    12
91          1915689            2020    11
91          1915689            2020    10
52          1300002            2020    9

输出：

account_no  recharge_number     year    month_n       
52          1300002            2021    6        
52          1300002            2021    5
52          1300002            2021    4
91          1915689            2021    6
91          1915689            2021    5
91          1915689            2021    4

我通过以下代码进行了尝试。这是正确的方法还是有更好的解决方案？

df = df.groupby(['account_no','recharge_number','year','month']).recharge_number.agg('count').to_frame('recharge_count').reset_index()

df[((df.month_n >=4) & (df.year ==2021) & (df.recharge_count>=3))]

【问题讨论】：

为什么最后一行保留2020？ @Chris，我的错！现在编辑！谢谢 【参考方案1】：

想法是使用几个月的周期，因此您可以简单地减去 n 周期并按 Series.between 过滤：

per = pd.Period('2021-07')
#exclusive, so 2021-04, 2021-05, 2021-06 is tested
prev = per - 4
print (prev)
2021-03

dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')

mask1 = df['per'].between(prev, per, inclusive=False)
df = df[mask1]

df=df[df.groupby(['account_no','recharge_number']).recharge_number.transform('size').gt(2)]
print (df)
    account_no  recharge_number  year  month_n      per
0           52          1300002  2021        6  2021-06
1           52          1300002  2021        5  2021-05
2           52          1300002  2021        4  2021-04
9           91          1915689  2021        6  2021-06
10          91          1915689  2021        5  2021-05
11          91          1915689  2021        4  2021-04

编辑：这里是用于掩码之间的辅助列和sum 的计数（匹配行）的替代方法，最后一个掩码由& 链接，用于bitweise AND 和|对于按位OR:

cols = df.columns

per = pd.Period('2021-07')

dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')

#exclusive, so 2021-04, 2021-05, 2021-06 is tested
df['prev3'] = df['per'].between(per - 4, per, inclusive=False)
df['prev6'] = df['per'].between(per - 7, per, inclusive=False)

prev3_groups = df.groupby(['account_no','recharge_number']).prev3.transform('sum').gt(2)
prev6_groups = df.groupby(['account_no','recharge_number']).prev6.transform('sum').gt(5)

df = df.loc[(df['prev3'] & prev3_groups) | (df['prev6'] & prev6_groups), cols]
print (df)
    account_no  recharge_number  year  month_n
0           52          1300002  2021        6
1           52          1300002  2021        5
2           52          1300002  2021        4
9           91          1915689  2021        6
10          91          1915689  2021        5
11          91          1915689  2021        4

另一个测试数据：

print (df)
    account_no  recharge_number  year  month_n
0           52          1300002  2021        6
1           52          1644460  2021        5
2           52          1644460  2021        4
3           52          1644460  2021        1
4           52          1644460  2021        6
5           52          1644460  2021        5
6           52          1644460  2021        2
7           70          1553984  2020       12
8           70          1553984  2020       11
9           91          1915689  2021        6
10          91          1915689  2021        5
11          91          1915689  2021        4
12          91          1915689  2020       12
13          91          1915689  2020       11
14          91          1915689  2020       10
15          52          1300002  2020        9

cols = df.columns

per = pd.Period('2021-07')

dat=pd.to_datetime(df[['year','month_n']].rename(columns='month_n':'month').assign(day=1))
df['per'] = dat.dt.to_period('m')

#exclusive, so 2021-04, 2021-05, 2021-06 is tested
df['prev3'] = df['per'].between(per - 4, per, inclusive=False)
df['prev6'] = df['per'].between(per - 7, per, inclusive=False)

prev3_groups = df.groupby(['account_no','recharge_number']).prev3.transform('sum').gt(2)
prev6_groups = df.groupby(['account_no','recharge_number']).prev6.transform('sum').gt(5)

df = df.loc[(df['prev3'] & prev3_groups) | (df['prev6'] & prev6_groups), cols]
print (df)
    account_no  recharge_number  year  month_n
1           52          1644460  2021        5
2           52          1644460  2021        4
3           52          1644460  2021        1
4           52          1644460  2021        6
5           52          1644460  2021        5
6           52          1644460  2021        2
9           91          1915689  2021        6
10          91          1915689  2021        5
11          91          1915689  2021        4

【讨论】：

谢谢。有用。我可以在这里设置“或”条件吗？比如，保留充值的行（recharge_number）——过去3个月内单个充值次数超过2次（month_n = 4 to 6）或最近6个月内单个充值次数超过5次（month_n = 1 到 6) @asifabdullah - 请给我一些时间。【参考方案2】：

您可以以您的日期为参考计算时间增量，并使用groupby.filter 和query 对您的行进行子集化：

# make datetime
date = pd.to_datetime(df[['year', 'month_n']].rename(columns='month_n': 'month').assign(day=1))

# check dates not older than 90 days from reference (2021-07)
df['gt_3months'] = date.gt(pd.to_datetime('2021-07')-pd.Timedelta('90days'))

# groupy account_no and filter
(df.groupby('account_no')
   .filter(lambda g: g['gt_3months'].sum()>=2) # check that there are 2 or more occurrences in the last 3 months
   .query('gt_3months == True') # keep only the recent occurrences
   .drop('gt_3months', axis=1)  # drop temporary column
)

输出：

    account_no  recharge_number  year  month_n
0           52          1300002  2021        6
1           52          1300002  2021        5
4           52          1644460  2021        6
5           52          1644460  2021        5
9           91          1915689  2021        6
10          91          1915689  2021        5

【讨论】：

以上是关于Pandas：仅当某个列值在过去 N 个月内出现 N 次时才保留行的主要内容，如果未能解决你的问题，请参考以下文章