根据堆叠条件为具有层次索引的 pandas DataFrame 赋值

Posted 2023-03-12

技术标签:

【中文标题】根据堆叠条件为具有层次索引的 pandas DataFrame 赋值【英文标题】：Assign value to pandas DataFrame with hierarchical index based on stacked condition 【发布时间】：2022-01-19 14:43:53 【问题描述】：

我有一个带有两级分层索引的 pandas DataFrame。我想根据从某个子集到另一个子集的条件设置一个值。

我认为最好用一个小例子来解释：

import numpy as np
import pandas as pd

example = pd.DataFrame('ind_1': 5*[0] + 5*[1], 'ind_2': np.concatenate([np.arange(5), np.arange(5)]),
                        'col_1': np.random.random(size=10), 'col_2': np.random.random(size=10))
example = example.set_index(['ind_1', 'ind_2'])
example_0 = example.loc[0]
example_1 = example.loc[1]
example['condition'] = False

condition = example_1['col_1'] > 0.5

使用数据帧

$ example
                col_1     col_2  condition
ind_1 ind_2                               
0     0      0.430966  0.064335      False
      1      0.631710  0.313696      False
      2      0.354766  0.479626      False
      3      0.548612  0.793249      False
      4      0.144033  0.352583      False
1     0      0.586365  0.578001      False
      1      0.306403  0.399591      False
      2      0.312621  0.439042      False
      3      0.010637  0.232054      False
      4      0.762034  0.293433      False

$ example_0
          col_1     col_2
ind_2                    
0      0.430966  0.064335
1      0.631710  0.313696
2      0.354766  0.479626
3      0.548612  0.793249
4      0.144033  0.352583

$ example_1
          col_1     col_2
ind_2                    
0      0.586365  0.578001
1      0.306403  0.399591
2      0.312621  0.439042
3      0.010637  0.232054
4      0.762034  0.293433

$ condition
ind_2
0     True
1    False
2    False
3    False
4     True

现在我想按如下方式赋值

example.loc[0].loc[condition] = True

这导致（理所当然地）SettingWithCopyWarning 并且在更复杂的情况下根本不起作用。

预期的输出是

$ example
                col_1     col_2  condition
ind_1 ind_2                               
0     0      0.430966  0.064335      True
      1      0.631710  0.313696      False
      2      0.354766  0.479626      False
      3      0.548612  0.793249      False
      4      0.144033  0.352583      True
1     0      0.586365  0.578001      False
      1      0.306403  0.399591      False
      2      0.312621  0.439042      False
      3      0.010637  0.232054      False
      4      0.762034  0.293433      False

所以对于ind_1 == 0，我们设置了条件。但请注意，条件是针对 ind_1 == 1 计算的

这样做最干净的方法是什么？

【问题讨论】：

你能显示你的预期输出吗？您可以将 condition 数组直接分配给列 (df["condition"] = df["col_1"] > 0.5) 刚刚编辑了问题以澄清。您的建议不起作用，因为我不想更改 ind_1 == 1 的 df 啊，我明白了。如果您想将索引用作数据，我建议您不要将这些列包含在索引中。在这种情况下，df["condition"] = (df.index.get_level_values('ind_1') == 0) & (df["col_1"] > 0.5) 应该这样做（并且应该清楚如果 ind_1 是一个普通列，这将变得多么容易感谢您的建议，但我仍然认为它不会起作用。也许澄清一下，我想为ind_1 == 0 设置条件，但条件本身是为ind_1 == 1 计算的。所以我不是在看col_1 本身，而是在col_1 上看ind_1 == 1。是的，& 运算符处理的是什么 【参考方案1】：

您可以在condition 上reindex 然后传递numpy 数组：

example.loc[0, 'condition'] = condition.reindex(example.loc[0].index).values

注意您不使用链索引分配，即.loc[].loc[]，而是使用.loc[ind, column]。

输出：

                col_1     col_2  condition
ind_1 ind_2                               
0     0      0.295983  0.241758      False
      1      0.707799  0.765772       True
      2      0.822369  0.062530       True
      3      0.816543  0.621883      False
      4      0.048521  0.738549       True
1     0      0.433304  0.527344      False
      1      0.727886  0.557176      False
      2      0.653163  0.686719      False
      3      0.020094  0.887114      False
      4      0.777072  0.506128      False

【讨论】：

哦，有道理。同样在这个例子中，一个简单的example.loc[0, 'condition'] = condition.values 也可以。谢谢:)

以上是关于根据堆叠条件为具有层次索引的 pandas DataFrame 赋值的主要内容，如果未能解决你的问题，请参考以下文章