Pandas 数据框列表：合并函数输出

Posted 2023-03-12

技术标签:

【中文标题】Pandas 数据框列表：合并函数输出【英文标题】：List of Pandas Dataframes: Merging Function Outputs 【发布时间】：2020-06-17 17:11:45 【问题描述】：

我研究过以前的similar questions，但找不到任何适用的线索：

我有一个名为“df”的数据框，其结构大致如下：

    Income  Income_Quantile Score_1 Score_2 Score_3
0   100000  5                75        75    100
1   97500   5                80        76    94
2   80000   5                79        99    83
3   79000   5                88        78    91
4   70000   4                55        77    80
5   66348   4                65        63    57
6   67931   4                60        65    57
7   69232   4                65        59    62
8   67948   4                64        64    60
9   50000   3                66        50    60
10  49593   3                58        51    50
11  49588   3                58        54    50
12  48995   3                59        59    60
13  35000   2                61        50    53
14  30000   2                66        35    77
15  12000   1                22        60    30
16  10000   1                15        45    12

使用“Income_Quantile”列和以下“for-loop”，我将数据框分为 5 个子数据框的列表（每个数据框都包含来自相同收入分位数的观察）：

dfs = []

for level in df.Income_Quantile.unique():
    df_temp = df.loc[df.Income_Quantile == level]
    dfs.append(df_temp)

现在，我想将以下函数用于计算数据帧的 spearman 相关性、p 值和 t 统计量（仅供参考：主函数中使用了 scipy.stats 函数）：

def create_list_of_scores(df):

    df_result = pd.DataFrame(columns=cols)
    df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
    df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
    df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]

    return df_result

“create_list_of_scores”使用的函数，即“ttest_ind”和“ttest_ind”，可以从scipy.stats访问如下：

从 scipy.stats 导入 ttest_ind 从 scipy.stats 导入 spearmanr

我在数据框的一个子集上测试了该函数：

data = dfs[1]
result = create_list_of_scores(data)

它按预期工作。

但是，在将函数应用于整个数据帧列表“dfs”时，会出现很多问题。如果我将其应用于数据框列表，如下所示：

result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)

我得到的输出为列“Score_1、Score_2 和 Score_3”x 5。

我想：

只有三列“Score_1、Score_2 和 Score_3”。使用 t 统计量、p 值和相关性作为第一级索引对输出进行索引，并且； “Income_Quantile”作为二级索引。

这是我的想法：

                  Score_1  Score_2  Score_3
t-statistic 1           
            2           
            3           
            4           
            5           
p-value     1           
            2           
            3           
            4           
            5           
correlation 1           
            2           
            3           
            4           
            5

知道如何按要求合并函数的输出吗？

【问题讨论】：

你能在问题中添加函数ttest_ind和spearmanr吗？它们是 scipy 函数： from scipy.stats import ttest_ind from scipy.stats import spearmanr 谢谢，MultiIndex 的一级顺序重要吗？或者现在可以像我的回答一样使用吗？ 【参考方案1】：

我认为最好使用GroupBy.apply:

cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):

    df_result = pd.DataFrame(columns=cols)
    df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
    df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
    df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
    return df_result

df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
                                  Score_1       Score_2       Score_3
            Income_Quantile                                          
correlation 1                         NaN           NaN           NaN
            2                         NaN           NaN           NaN
            3                6.837722e-01  0.000000e+00  1.000000e+00
            4                4.337662e-01  6.238377e-01  4.818230e-03
            5                2.000000e-01  2.000000e-01  2.000000e-01
p-value     1                8.190692e-03  8.241377e-03  8.194933e-03
            2                5.887943e-03  5.880440e-03  5.888611e-03
            3                3.606128e-13  3.603267e-13  3.604996e-13
            4                5.584822e-14  5.587619e-14  5.586583e-14
            5                3.861801e-06  3.862192e-06  3.864736e-06
t-statistic 1                1.098143e+01  1.094719e+01  1.097856e+01
            2                1.297459e+01  1.298294e+01  1.297385e+01
            3                2.391611e+02  2.391927e+02  2.391736e+02
            4                1.090548e+02  1.090479e+02  1.090505e+02
            5                1.594605e+01  1.594577e+01  1.594399e+01

【讨论】：

以上是关于Pandas 数据框列表：合并函数输出的主要内容，如果未能解决你的问题，请参考以下文章