pandas-groupby子组的频率计算,新行的插入和列的重新排列
Posted
技术标签:
【中文标题】pandas-groupby子组的频率计算,新行的插入和列的重新排列【英文标题】:Frequency calculations on subgroups of pandas-groupby, insertion of new rows and rearrangement of columns 【发布时间】:2020-08-22 19:48:56 【问题描述】:我需要一些帮助才能对子组执行一些操作,但我真的很困惑。我将尝试使用 cmets 快速描述操作和所需的输出。
(1) 计算每个子组的出现频率百分比
(2) 出现一条不存在的记录,带0
(3) 重新排列记录和列的顺序
假设下面的df为原始数据:
df=pd.DataFrame('store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes'])
下面的 grouped_df 与我的想法很接近,但我无法获得所需的输出。
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace(np.nan:0)
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
我尝试单独处理每个组,但是 i) 它没有考虑替换的 NaN 值,ii) 我应该避免处理每个组,因为之后我需要连接很多组(这个 df 只是一个例子)因为我稍后需要绘制整个组。
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
希望我的问题是有道理的。非常感谢您对我尝试执行的任何见解,非常感谢!
【问题讨论】:
【参考方案1】:在我发布答案前一天,@Quang Hoang 试图帮助解决这个问题,在我的帮助下,我设法找到了解决方案。
为了解释计算的最后一点,我通过将每个元素除以每个组的计数总和来转换每个元素,以找到每个元素的频率 0th-level-group-wise 而不是 row/column/total-wise .
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace(np.nan:0)\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
运行上面的代码,产生所需的输出:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0
【讨论】:
我仍然相信多次拆垛和重新堆叠并不是将对象转换为所需格式的最 Pythonic 方式,我欢迎任何其他使用更优雅和更复杂代码的答案。举个例子,我的想法是 astype('category') 和 reindex(level='branch'),但我还没有达到能够胜任分类索引的地步。以上是关于pandas-groupby子组的频率计算,新行的插入和列的重新排列的主要内容,如果未能解决你的问题,请参考以下文章