使用 groupby 循环遍历 pandas 中的多个变量组合

Posted 2023-03-11

技术标签:

【中文标题】使用 groupby 循环遍历 pandas 中的多个变量组合【英文标题】：Looping over multiple variable combinations in pandas with groupby 【发布时间】：2014-04-26 22:28:58 【问题描述】：

所以我有一些数据，例如：

df = pandas.DataFrame("X1":[1,2,5]*4, "X2":[1,2,10]*4, 
                       "Y":[2,4,6]*4, "group":["A","B"]*6)

我想为每个组和每个相关的变量组合创建一个线性回归斜率系数表，类似于：

group  x   y  coef
A      X1  Y  0.97
A      X2  Y  0.85
B      X1  Y  0.73
B      X2  Y  0.81

我正在尝试这样做：

def OLS_slope_coef(df, xcol=0, ycol=1):
  x = df.ix[:,xcol]
  y = df.ix[:,ycol]
  slope, intercept, r, p, stderr = scipy.stats.linregress(x, y)
  return(slope)


s_df = pandas.DataFrame()
for x in ['X1', 'X2']:
    for y in ['Y']:
        s_df.ix[(x, y), 'coef'] = df.groupby('group').apply(OLS_slope_coef, x, y)

但它给出了ValueError: Incompatible indexer with Series。

有没有办法做这样的事情？我不在乎 group、x 和 y 变量是索引还是数据框列（无论如何我要去.reset_index()）。

【问题讨论】：

【参考方案1】：

问题在于.apply 返回一个包含两个元素的系列（因为有两个组），索引为“A”和“B”，因此与.ix[(x,y), 'coef'] 不兼容。你可以这样做：

s_df = pd.DataFrame(index=['A', 'B'])
for x in ['X1', 'X2']:
    for y in ['Y']:
        s_df.loc[:, x + '-coef'] = df.groupby('group').apply(OLS_slope_coef, x, y)

导致：

   X1-coef  X2-coef
A     0.92     0.37
B     0.92     0.37

[2 rows x 2 columns]

或者，在应用的函数内循环并返回一个数据框：

import pandas as pd
def ols(df, xcols):
    from itertools import chain
    from scipy.stats import linregress
    fitcols = ['slope', 'intercept', 'rval', 'pval', 'stderr']
    cols = pd.MultiIndex.from_tuples([(var, k) for var in xcols for k in fitcols])
    fit = [linregress(df[xcol], df.Y) for xcol in xcols]
    return pd.DataFrame([list(chain(*fit))], columns=cols)

fit = df.groupby('group').apply(ols, xcols=['X1', 'X2'])
fit.reset_index(level=1, drop=True, inplace=True)

这会给：

          X1                                    X2                               
       slope  intercept  rval  pval  stderr  slope  intercept  rval  pval  stderr
group                                                                            
A       0.92       1.54  0.96     0    0.13   0.37        2.4  0.91  0.01    0.08
B       0.92       1.54  0.96     0    0.13   0.37        2.4  0.91  0.01    0.08

[2 rows x 10 columns]

【讨论】：

我拥有的输出表中的 y 列是相关的 - 我希望能够对多个自变量（X1、X2、..、Xn）进行所有组合和多个因变量（Y1、Y2、..、Yn）。我想我可以将您的第一个建议与 y 变量一起使用，然后执行 pandas.melt() 以获取列中的列标题，然后在连字符或其他内容上进行列拆分。该解决方案的最后一部分在这里：***.com/questions/17116814/…

以上是关于使用 groupby 循环遍历 pandas 中的多个变量组合的主要内容，如果未能解决你的问题，请参考以下文章