如何根据阈值在 Python 中对多列进行分组并创建新列
Posted
技术标签:
【中文标题】如何根据阈值在 Python 中对多列进行分组并创建新列【英文标题】:How to groupby multiple columns and create a new column in Python based on thresholds 【发布时间】:2020-06-24 16:14:11 【问题描述】:我有如下数据框
输入
Invoice No Date Text Vendor Days
1000001 1/1/2020 Rent Payment A 0
1000003 2/1/2020 Rent Payment A 1
1000005 4/1/2020 Rent Payment A 2
1000007 6/1/2020 Water payment A 2
1000008 9/2/2020 Rep Payment A 34
1000010 9/2/2020 Car Payment A 0
1000011 10/2/2020 Car Payment A 1
1000012 15/2/2020 Car Payment A 5
1000013 16/2/2020 Car Payment A 1
1000015 17/2/2020 Car Payment A 1
1000002 1/1/2020 Rent Payment B -47
1000004 4/1/2020 Con Payment B 3
1000006 6/1/2020 Con Payment B 2
1000009 9/2/2020 Water payment B 34
1000014 17/2/2020 Test Payment B 8
1000016 19/2/2020 Test Payment B 2
条件
如何编写检查描述、供应商名称和天数列的python条件,如果描述、供应商名称相同且天数为
预期输出
Invoice No Date Text Vendor Days Group
1000001 1/1/2020 Rent Payment A 0 G1
1000003 2/1/2020 Rent Payment A 1 G1
1000005 4/1/2020 Rent Payment A 2 G1
1000007 6/1/2020 Water payment A 2 G2
1000008 9/2/2020 Rep Payment A 34 G3
1000010 9/2/2020 Car Payment A 0 G4
1000011 10/2/2020 Car Payment A 1 G4
1000012 15/2/2020 Car Payment A 5 G5
1000013 16/2/2020 Car Payment A 1 G5
1000015 17/2/2020 Car Payment A 1 G5
1000002 1/1/2020 Rent Payment B -47 G6
1000004 4/1/2020 Con Payment B 3 G7
1000006 6/1/2020 Con Payment B 2 G7
1000009 9/2/2020 Water payment B 34 G8
1000014 17/2/2020 Test Payment B 8 G9
1000016 19/2/2020 Test Payment B 2 G9
【问题讨论】:
【参考方案1】:您需要在三个项目上使用groupby
:'Text'
、'Vendor'
,以及在仅由['Text', 'Vendor']
定义的组中'Days'
的变化是否超过2
的布尔表示。
之后,您需要命名唯一组。下面我提供了两种方法。
ngroup
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = df.groupby(['Text', 'Vendor', d], sort=False).ngroup()
df.assign(Group=g.add(1).astype(str).radd('G'))
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
factorize
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
d = df.groupby(['Text', 'Vendor']).Days.transform(f)
g = pd.factorize([*zip(df.Text, df.Vendor, d)])[0]
df.assign(Group=[f'Gi + 1' for i in g])
Invoice No Date Text Vendor Days Group
0 1000001 1/1/2020 Rent Payment A 0 G1
1 1000003 2/1/2020 Rent Payment A 1 G1
2 1000005 4/1/2020 Rent Payment A 2 G1
3 1000007 6/1/2020 Water payment A 2 G2
4 1000008 9/2/2020 Rep Payment A 34 G3
5 1000010 9/2/2020 Car Payment A 0 G4
6 1000011 10/2/2020 Car Payment A 1 G4
7 1000012 15/2/2020 Car Payment A 5 G5
8 1000013 16/2/2020 Car Payment A 1 G5
9 1000015 17/2/2020 Car Payment A 1 G5
10 1000002 1/1/2020 Rent Payment B -47 G6
11 1000004 4/1/2020 Con Payment B 3 G7
12 1000006 6/1/2020 Con Payment B 2 G7
13 1000009 9/2/2020 Water payment B 34 G8
14 1000014 17/2/2020 Test Payment B 8 G9
15 1000016 19/2/2020 Test Payment B 2 G9
一些细节
# The first element of group Cumulatively summing True/False
# will get NaN so we fill it will create a new value every time
# in with 0 ║ we see a True. This creates groups
# ║ ║
# adjacent differences Should be obvious
# ╭─┴──╮ ╭───╨───╮ ╭─┴─╮ ╭───╨──╮
f = lambda x: x.diff().fillna(0).gt(2).cumsum()
【讨论】:
@piSquared,我提供的预期输入和输出中存在小错误,刚刚更正。 我的意思是要求检查供应商和描述以及天列,如果供应商和描述相同并且相邻行之间的天差为 【参考方案2】:您可以将您的条件组合成groupby
并使用ngroup
。
df['Group'] = df['Group'] = (df.groupby([df['Description'].ne(df['Description'].shift()).cumsum(),
df['Vendor'].ne(df['Vendor'].shift()).cumsum(),
df['Days']<=2]).ngroup()+1)
.astype(str).str.pad(2, 'left','G')
# same description : df['Description'].ne(df['Description'].shift()).cumsum()
# same vendor : df['Vendor'].ne(df['Vendor'].shift()).cumsum()
# Days<=2 : df['Days']<=2
输出:
Invoice No Date Description Vendor Days Group
0 123456 2020-01-01 Rent Payment A 0 G1
1 123457 2020-02-01 Rent Payment A 1 G1
2 123458 2020-04-01 Rent Payment A 2 G1
3 123459 2020-06-01 Water Payment A 2 G2
4 123460 2020-09-02 Rent Payment A 34 G3
5 123461 2020-09-02 Rep Payment A 0 G4
6 123462 2020-10-02 Rep Payment A 1 G4
7 123463 2020-11-02 Rep Payment A 2 G4
8 123464 2020-02-20 Water Payment A 11 G5
【讨论】:
到第 4 行是它的 aggining 正确,从第 5 行开始它不正确。 你为什么把Rent Payment
和Rep Payment
当作同一个Description
? @Rahulrajan
抱歉打错了,我的意思是要求检查供应商和描述以及天列,如果供应商和描述相同并且相邻行之间的天差是
这正是它的作用。 @Rahulrajan 在您的示例中似乎有一个错误。 Rent Payment
和 Rep Payment
在同一个组中,即使不一样。以上是关于如何根据阈值在 Python 中对多列进行分组并创建新列的主要内容,如果未能解决你的问题,请参考以下文章
如何根据多列的顺序对 PostgreSQL 中的聚合进行分组?
有没有办法根据小数位在python中对数组/列表的值进行动态分组?
你将如何在 python 的数组中对这三个区域进行分组/聚类?