pandas通过指定列获取平均值和模式组[重复]

Posted 2023-03-11

技术标签:

【中文标题】pandas通过指定列获取平均值和模式组[重复]【英文标题】：pandas get the average and mode group by specified columns [duplicate] 【发布时间】：2018-06-21 20:54:59 【问题描述】：

     rev_id  worker_id  toxicity  toxicity_score
0    2232.0        723         0             0.0
1    2232.0       4000         0             0.0
2    2232.0       3989         0             1.0
3    2232.0       3341         0             0.0
4    2232.0       1574         0             1.0
5    2232.0       1508         0             1.0
6    2232.0        772         0             1.0
7    2232.0        680         0             0.0
8    2232.0        405         0             1.0
9    2232.0       4020         1            -1.0
10   4216.0        500         0             0.0
11   4216.0        599         0             0.0
12   4216.0        339         0             2.0
13   4216.0        257         0             0.0
14   4216.0        303         0             1.0
15   4216.0        188         0             0.0
16   4216.0       1549         0             1.0
17   4216.0         64         0             1.0
18   4216.0       1527         0             0.0
19   4216.0       1502         0             0.0
20   8953.0       2596         0             1.0
21   8953.0       2403         0             0.0
22   8953.0       2539         0             0.0
23   8953.0       2542         0             0.0
24   8953.0       2544         0             0.0
25   8953.0       1016         0             0.0
26   8953.0       2550         0             0.0
27   8953.0       2578         0             0.0
28   8953.0       2494         0             0.0
29   8953.0        971         0             0.0

我想从toxicity 中获取模式编号（1 或 0），并从 toxicity_score 中获取平均值，通过 pandas 的 rev_id 分组。我怎样才能做到这一点？谢谢。

【问题讨论】：

【参考方案1】：

看来您需要groupby 与agg mean 和mode 聚合：

df = (df.groupby('rev_id', as_index=False)
        .agg('toxicity_score':'mean', 'toxicity': lambda x: x.mode()))

替代方案是value_counts，选择索引的第一个值：

df = (df.groupby('rev_id', as_index=False)
        .agg('toxicity_score':'mean', 'toxicity': lambda x: x.value_counts().index[0]))

print (df)
   rev_id  toxicity_score  toxicity
0  2232.0             0.4         0
1  4216.0             0.5         0
2  8953.0             0.1         0

【讨论】：

操作后rev_id又不是一列了。如何将结果转换为三列请检查上次编辑。 mode() 返回两个数字，例如 [0,1]。我只想通过 rev_id 获得最常见的数字组看来您需要x.mode()[0] 或将pandas 升级到最新版本，o.22.0 运行良好。谢谢。如果一个 rev_id 中有五个 1 和五个 0，那么 x.mode() 中的顺序是什么？

以上是关于pandas通过指定列获取平均值和模式组[重复]的主要内容，如果未能解决你的问题，请参考以下文章

Pandas - 用特定组的平均值替换列中的 NaN

pandas通过DatetimeProperties对象获取日期对象在所在周的周几星期几的名称信息（week name）并生成新的数据列计算不同星期名称下其它数据列的均值

计算 Pandas 数据框中的平均真实范围列 [重复]

Pandas Dataframe：获取组内每个子组的第一行的平均值

pandas使用ewm函数计算dataframe指定数据列的的特定周期指数移动（滚动）平均（Exponential Moving Average）

Pandas DataFrame：如何获取列平均值但仅考虑索引低于我想要获取平均值的行