如何根据不同df中提供的索引值对数据框中的特定列进行求和。返回总和和布尔 T/F

Posted 2023-03-12

技术标签:

【中文标题】如何根据不同df中提供的索引值对数据框中的特定列进行求和。返回总和和布尔 T/F【英文标题】：How can I Sum specific columns in a dataframe based on index value provided in a different df. Return the sum and Boolean T/F 【发布时间】：2021-12-18 01:29:37 【问题描述】：

我有两个熊猫数据框。 df_lst 包含列名和期望值的列表，df 包含一系列数据。

df_lst 中的列名可能会更改，我使用以下脚本从 df 中查找与 df_lst 显示此代码以防它是可能不需要的额外步骤。

ind_dict = dict((k,i) for i,k in enumerate(d.columns))
inter = set(df_lst['Col_Name']).intersection(df)
df_lst['Index'] = [ ind_dict[x] for x in inter ]

此任务的输入如下所示：

import random
import numpy as np
import pandas as pd

a = np.random.randint(12, size=(7, 11))
df = pd.DataFrame(a, ['foo','foo','bar', 'bar', 'bar', 'foo', 'foo'], ['a','b','f','g','h','j' ,'k', 'r', 's', 't', 'z'])

df_lst = pd.DataFrame('Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9])

如何使用新的索引值来查看 df 中的相应列并对值求和，如果大于则返回总和值和“真”，如果小于则返回“假”比 df_lst

中的每一行

df_out = pd.DataFrame('Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'], 
                   'Expected Value': [100, 90, 122, 111],                                      
                   'Index': [4, 6, 8, 9],
                   'Sum of Col': ['sum of col_g', 'sum of col_j', 'sum of col_r', 'sum of col_s'],
                   'Bool': ['True or False', 'True or False', 'True or False', 'True or False']
                   )

最终，这个 True/False 数据将成为 while 循环的一部分，该循环检查诸如“while 1 or more is false do X”之类的内容

【问题讨论】：

不要用文字描述你的数据。相反，请提供清楚了解您的问题并允许我们重现问题所需的所有信息。这意味着制作一个示例，显示df、df_lst 的样本和预期的输出（已解释）。请参阅此处如何制作good pandas reproducible example。帮助我们为您提供帮助。好的，我会更新帖子。谢谢你的建议:) 【参考方案1】：

我们可以使用df_lst['Index'] 和iloc 中的值从df 中选择值，我们需要减去1 才能将基于1 的索引转换为基于0 的索引。然后sum 列和join 回到DataFrame。然后我们可以根据新的Sum of Col 值计算Bool 列：

df_out = df_lst.join(
    df.iloc[:, df_lst['Index'] - 1].sum()
        .add_prefix('Col_')
        .rename('Sum of Col'),
    on='Col_Name'
)

df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']

df_out:

  Col_Name  Expected Value  Index  Sum of Col   Bool
0    Col_g             100      4         106   True
1    Col_j              90      6          97   True
2    Col_r             122      8          95  False
3    Col_s             111      9         113   True

步骤：

使用 iloc 选择注意整数索引从 0 开始，因此 g 列位于索引 3 而不是 4：

df.iloc[:, df_lst['Index'] - 1]

      g   j   r   s
foo   0   7  14  16
foo  23  13  12  12
bar   5  13   3  16
bar  17  13  24  16
bar  24  14  11  23
foo  17  19  24  17
foo  20  18   7  13

用sum 对列求和：

df.iloc[:, df_lst['Index'] - 1].sum()
Out[3]: 
g    106
j     97
r     95
s    113
dtype: int64

add_prefix 所以列与Col_Name 列和rename 系列匹配，因此新列具有正确的名称：

df.iloc[:, df_lst['Index'] - 1].sum().add_prefix('Col_').rename('Sum of Col')

Col_g    106
Col_j     97
Col_r     95
Col_s    113
Name: Sum of Col, dtype: int64

join 和df_lst：

df_lst.join(
    df.iloc[:, df_lst['Index'] - 1].sum()
        .add_prefix('Col_')
        .rename('Sum of Col'),
    on='Col_Name'
)

  Col_Name  Expected Value  Index  Sum of Col
0    Col_g             100      4         106
1    Col_j              90      6          97
2    Col_r             122      8          95
3    Col_s             111      9         113

进行任何需要的比较并添加任何其他列：

df_out['Bool'] = df_out['Sum of Col'] > df_out['Expected Value']

  Col_Name  Expected Value  Index  Sum of Col   Bool
0    Col_g             100      4         106   True
1    Col_j              90      6          97   True
2    Col_r             122      8          95  False
3    Col_s             111      9         113   True

可重现的设置：

import pandas as pd
from numpy.random import Generator, MT19937

rng = Generator(MT19937(25))
a = rng.integers(25, size=(7, 11))
df = pd.DataFrame(a, ['foo', 'foo', 'bar', 'bar', 'bar', 'foo', 'foo'],
                  ['a', 'b', 'f', 'g', 'h', 'j', 'k', 'r', 's', 't', 'z'])

df_lst = pd.DataFrame('Col_Name': ['Col_g', 'Col_j', 'Col_r', 'Col_s'],
                       'Expected Value': [100, 90, 122, 111],
                       'Index': [4, 6, 8, 9])

【讨论】：

感谢您分解每个步骤。当我了解更多时，这真的很有帮助！我将把它构建到我的脚本中，然后尝试将它包装到一个 while 循环中。

以上是关于如何根据不同df中提供的索引值对数据框中的特定列进行求和。返回总和和布尔 T/F的主要内容，如果未能解决你的问题，请参考以下文章

如何根据列的值对熊猫数据框中的列进行分类？

如何根据数据框中的列值获取特定的行数[重复]

如何使用 loc[i,j] 根据索引值访问数据框中的特定值

如何将数据框中的特定列与同一数据框中的一个特定列相乘？

如何为 pandas 数据框中的不同组分配唯一 ID？

如何使用 selectInput 从 R 中的数据框中选择特定列？