按聚合对范围内的缺失值进行分组

Posted 2023-03-11

技术标签:

【中文标题】按聚合对范围内的缺失值进行分组【英文标题】：Group by aggregation for missing values in range 【发布时间】：2021-06-20 07:08:25 【问题描述】：

我有一个 pandas 数据框 test，我想将其值转换为 categories 中所有整数的百分位数，例如：

import pandas as pd

categories = [0,1,2,3,4,5,6,7,8,9,10]


test  
id  value
foo  0
foo  0
foo  1
foo  1
foo  5
foo  4
foo  4
foo  4
foo  3
foo  3
bar  2
bar  2
bar  2
bar  2
bar  2
bar  6
bar  6
bar  6
bar  6
bar  6

我遇到的问题是将 0 百分位数映射到类别中所有可能的整数。我当我尝试

test.groupby('id')['value'].apply(lambda x: x.value_counts(normalize=True)).unstack().fillna(0)

返回以下数据框，但缺少值 7、8、9、10 等，因为它们不包含在每个 id 中：

    0   1   2   3   4   5   6
id                          
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.5
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.0

有没有一种有效的方法将catgories的所有值添加到value_count聚合函数中，从而返回以下结果？

    0   1   2   3   4   5   6   7   8   9  10
foo 0.2 0.2 0.0 0.2 0.3 0.1 0.5 0   0   0   0
bar 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0   0   0   0

【问题讨论】：

【参考方案1】：

尝试一下，不一定更高效但更易读：

(pd.crosstab(df['id'], df['value'])
   .reindex(categories, axis=1, fill_value=0)
)

【讨论】：

【参考方案2】：

`Categorical`

df['value'] = pd.Categorical(df.value, categories)

df.groupby(['id', 'value']).size().unstack()

value  0  1  2  3  4  5  6  7  8  9  10
id                                     
bar    0  0  5  0  0  0  5  0  0  0   0
foo    2  2  0  2  3  1  0  0  0  0   0

过度设计

i, r = pd.factorize(df['id'].to_numpy())
j, c = pd.factorize(df['value'].to_numpy())
n, m = r.size, c.size

b = np.zeros((n, 11), np.int64)

np.add.at(b, (i, j), 1)

pd.DataFrame(b, r, range(11))

     0   1   2   3   4   5   6   7   8   9   10
foo   2   2   1   3   2   0   0   0   0   0   0
bar   0   0   0   0   0   5   5   0   0   0   0

【讨论】：

我也知道Categorical，但value_counts 失败得很厉害，size 显然有效。我查看了名单，看看哪个有效……我知道这是其中之一。我首先尝试了crosstab，但没有（-：原来是groupby，它尊重分类类型嗯！实际上df.value.value_counts() 工作得很好。分类就是这样。 @QuangHoang groupby('id')['value']，缺失的类别不显示。 groupby('value')['id'].value_counts() 对我不起作用（1.1.4）。我没想到...但是是的。真不幸。

以上是关于按聚合对范围内的缺失值进行分组的主要内容，如果未能解决你的问题，请参考以下文章