如何使用 pandas groupby() 的 split-apply-combine 模式同时规范化多个列

Posted 2023-03-12

技术标签:

【中文标题】如何使用 pandas groupby() 的 split-apply-combine 模式同时规范化多个列【英文标题】：How to use split-apply-combine pattern of pandas groupby() to normalize multiple columns simultaneously 【发布时间】：2017-12-14 07:15:21 【问题描述】：

我正在尝试对 pandas 数据表中的实验数据进行规范化，该数据表包含多个具有数字可观察值（特征）的列、具有日期和实验条件的列以及其他非数字条件（例如文件名）。

我愿意

使用拆分-应用-组合范例在组内标准化，使用子组的聚合统计数据使用不同的归一化（例如除以控制均值、Z 分数）将此应用于所有数值列（可观察值）最后，生成一个增强数据表，其结构与原始数据表相同，但具有额外的列，例如对于 Observable1 列，应该添加一个列 normalized_Observable1

使用此代码sn -p:: 可以生成具有这种结构的简化数据表：

import numpy as np
import pandas as pd
df = pd.DataFrame(
   'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
   'date':  ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
   'observation1':  [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'observation2':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'observation3':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'rawsource': ["1.tif", "2.tif", "3.tif",  "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
)
print(df)

看起来像这样

   condition      date  observation1  observation2  observation3 rawsource
0       ctrl  20170131           1.2          3.10          2.00     1.tif
1        abc  20170131           2.2          2.20          1.20     2.tif
2       ctrl  20170131           1.3          2.10          1.20     3.tif
3        abc  20170131           1.1          1.20          2.01     4.tif
4        def  20170131           2.3          2.40          2.55     5.tif
5       ctlr  20170606           2.3          1.20          2.05     6.tif
6       ctlr  20170606           4.2          1.50          1.66     7.tif
7     asdasd  20170606           3.3          1.33          3.20     8.tif
8       afff  20170606           5.1          1.50          3.21     9.tif
9       afff  20170606           3.3          1.60          3.04    10.tif
10       gr1  20170404           3.4          1.40          8.01    11.tif
11       gr2  20170404           5.5          1.30          9.10    12.tif
12       gr2  20170404           9.9          0.90          7.06    13.tif
13      ctrl  20170404           3.2          0.78          8.10    14.tif
14      ctrl  20170404           1.1          1.20          7.90    15.tif
15      kjkj  20161212           3.3          4.00          5.12    16.tif
16      asht  20161212           1.2          5.00          5.23    17.tif
17      ctrl  20161212           5.4          6.00          5.15    18.tif

现在，对于每个实验日期，我都有不同的实验条件，但我总是有一个名为 ctrl 的条件。我想要执行的标准化之一是计算（对于每个数值列）该日期的对照实验的平均值，然后将该日期的所有可观察值除以它们相应的平均值。

我可以使用以下方法快速计算一些按日期、按条件汇总的统计数据：

grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))

然后我想将这些汇总统计数据应用于每个实验日期的标准化：

grdate = df.groupby("date")

并以如下方式应用规范化：

def normalize_by_ctrlmean(grp_frame, summarystats):
    #  the following is only pseudo-code as I don't know how to do this
    grp_frame/ summarystats(nanmean)

grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)

最后一步只是伪代码。这就是我正在努力解决的问题。我可以使用嵌套的 for 循环对数字列的日期、条件和列名进行规范化，但我是 split-apply-combine 范式的新手，我认为必须有一个简单的解决方案？非常感谢任何帮助。

【问题讨论】：

【参考方案1】：

以下是使用df.apply 执行此操作的方法：

拆分

由于您要“按日期”执行操作，因此只需按日期拆分：

grdate = df.groupby("date")

应用和组合

接下来，定义一个可以应用于每个组的转换函数，将组本身作为参数。

在您的情况下，该函数应计算组的 ctrl 值的平均值，然后将该组的所有观察值除以该平均值：

def norm_apply(group):

    # Select the 'ctrl' condition
    ctrl_selected = group[group['condition']=='ctrl']

    # Extract its numerical values
    ctrl_numeric = ctrl_selected.select_dtypes(include=[np.number])

    # Compute the means (column-wise)
    ctrl_means = np.nanmean(ctrl_numeric,axis=0) 

    # Extract numerical values for all conditions
    group_numeric = group.select_dtypes(include=[np.number])

    # Divide by the ctrl means
    divided = group_numeric / ctrl_means

    # Return result
    return divided

（如果你愿意的话，你可以把它当作一个愚蠢的单线来做......）

norm_apply = lambda x : x.select_dtypes(include=[np.number]) / np.nanmean(x[x['condition']=='ctrl'].select_dtypes(include=[np.number]),axis=0)

现在你可以简单地apply这个函数到你的分组数据框：

normed = grdate.apply(norm_apply)

这应该为您提供所需的值，组合成与原始 df 相同的形状/顺序：

normed.head()

>>   observation1  observation2  observation3
0          0.96      1.192308       1.25000
1          1.76      0.846154       0.75000
2          1.04      0.807692       0.75000
3          0.88      0.461538       1.25625
4          1.84      0.923077       1.59375

合并到原始数据帧中

将这些结果添加回原始 df 的一种方法如下：

# Add prefix to column names
normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
final = pd.concat([df,normed],axis=1)
display(final.head())

最后，您可以按日期和条件分组并查看方法：

final.groupby(['date','condition']).mean()

如果一切正常，ctlr 条件的方法应该都是1.0。

（旁注：虽然 Ian Thompson 的回答也有效，但我相信这种方法更接近于拆分-应用-组合的意识形态。）

【讨论】：

太好了，这正是我想要的。非常感谢。【参考方案2】：

我对你想要的功能有点困惑。我没有足够的声誉来发表评论，所以我会尽力回答你的问题。

看到您的函数称为normalize_by_ctrlmean，我假设您希望在每个观察中始终除以ctrl 组中的mean。为此，我们必须使用 melt 函数稍微整理一下您的数据。

df1 = df.melt(id_vars = ['condition',
                         'date',
                         'rawsource'],
              value_vars = ['observation1',
                            'observation2',
                            'observation3'],
              var_name = 'observations')

df1.head()

接下来我们将计算ctrl 组的mean

ctrl_mean = df1[df1.condition == 'ctrl'].groupby(['date',
                                                  'observations']).agg('mean').reset_index().rename(columns = 'value' : 'ctrl_mean')

ctrl_mean

将此数据帧与熔化的数据帧合并。

df2 = df1.merge(ctrl_mean,
                how = 'inner',
                on = ['date',
                      'observations'])

df2.head()

最后，将value 列除以ctrl_mean 列并插入数据框。

df2.insert(df2.shape[1],
           'normalize_by_ctrlmean',
           df2.loc[:, 'value'] / df2.loc[:, 'ctrl_mean'])

df2.head()

希望这能让你更接近你所需要的。

编辑

根据您的评论，我将展示如何首先使用 pivot_table 函数，然后使用 groupby 函数返回到具有 observation 列的类似数据框。

数据透视表

df2.pivot_table(index = ['date', # columns to use as the index
                   'condition',
                   'rawsource'],
          columns = 'observations', # this will make columns out of the values in this column
          values = ['value', # these will be the values in each column
                    'ctrl_mean', # swaplevel swaps the column levels (axis = 1), sort_index sorts and "smooshes" them together
                    'normalize_by_ctrlmean']).swaplevel(axis = 1).sort_index(axis = 1).reset_index() # reset_index so you can refer to specific columns

分组方式

df2.groupby(['date', # groupby these columns to make the index
             'condition',
             'rawsource',
             'observations']).agg('value' : 'max', # take the max of these as the aggregate (there was only one value for each so the max just returns that value)
                                   'ctrl_mean' : 'max', # unstack('observations') makes columns out of the 'observations'
                                   'normalize_by_ctrlmean' : 'max').unstack('observations').swaplevel(axis = 1).sort_index(axis = 1).reset_index() # these do the same thing as on the pivot_table example

此外，您可以删除 swaplevel 和 sort_index 函数以将聚合列保留在顶层而不是 observations

【讨论】：

嗨，Ian，谢谢，看起来已经很好了。至少让我以不同的方式思考。我必须阅读我不熟悉的 pd.melt 函数以及宽格式和长格式之间的区别。你的建议让我得到了我认为的大部分方式，因为我可以通过这种方式计算所需的标准化。对于下游绘图，我将需要返回宽格式并创建新列 observable1_normalized 等。只需阅读 pd.pivot 以了解如何执行此操作。找到解决方案后会更新我的帖子。

以上是关于如何使用 pandas groupby() 的 split-apply-combine 模式同时规范化多个列的主要内容，如果未能解决你的问题，请参考以下文章