如何根据多个条件将 1 个 pandas 数据帧合并或组合到另一个数据帧

Posted 2023-03-11

技术标签:

【中文标题】如何根据多个条件将 1 个 pandas 数据帧合并或组合到另一个数据帧【英文标题】：How to merge or combine 1 pandas dataframe to another one based on multiple conditions 【发布时间】：2021-10-18 16:40:27 【问题描述】：

我有 2 个数据框：

df1 和 df2 ,df1 用来作为 df2 的参考或查找文件。这意味着我们需要使用 df1 的每一行来匹配 df2 的每一行，然后将 df1 合并到 df2 中，然后输出新的 df2。

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G
5   00  19860101    G
6   52  19820101    G
7   53  19820101    G
8   00  19500201    G

如何根据条件合并这两个数据框：

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

注意最后的输出是将df1合并到df2，因为df1就像是df2的参考或查找文件。这意味着我们需要使用df1的每一行来匹配df2的每一行，然后输出新的df2

输出应如下所示：

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None
5   00  19860101    G   50
6   52  19820101    G   None
7   53  19820101    G   None
8   00  19500201    G   45

我知道一种方法可以做到这一点，但它非常慢，尤其是当 d1 的长度很大时：

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

有更快的解决方案吗？

【问题讨论】：

【参考方案1】：

使用IntervalIndex -

idx = pd.IntervalIndex.from_arrays(df1['BeginDate'],df1['EndDate'],closed='both')
for x in df1['RB'].unique():
    mask = df2['RB']==x
    df2.loc[mask, 'Valindex0'] = df1.loc[idx.get_indexer(df2.loc[mask, 'IssueDate']), 'Valindex0'].values

输出

   RB  IssueDate gs  Valindex0
0  L3   19990201  8        NaN
1  00   19820101  G       47.0
2  48   19820101  G        NaN
3  50   19820101  G        NaN
4  50   19820101  G        NaN
5  00   19860101  G       50.0
6  52   19820101  G        NaN
7  53   19820101  G        NaN
8  00   19500201  G       45.0

【讨论】：

感谢您的回复，请问如果我有更多列需要检查，例如名为“type”的列，在两个数据框中，就像列 RB ,df1['type' ]=df2['type'],我可以在 df1[['RB','type']].unique():...中使用 x 吗？【参考方案2】：

如果您不关心记录或索引的顺序，您可以这样做

df1 = pd.read_clipboard()
df2 = pd.read_clipboard()

# pandas wants to cast this column as int
df1['RB'] = '00'

new_df = df2.merge(df1, how='outer', on='RB')

mask = ((new_df['BeginDate'] <= new_df['IssueDate']) & (new_df['IssueDate'] <= new_df['EndDate'])
       )| new_df['Valindex0'].isnull()

new_df[['RB','IssueDate', 'gs', 'Valindex0']][mask]

这个想法是先做一个全外连接，然后过滤数据集。

结果：

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   NaN
2   00  19820101    G   47.0
7   00  19860101    G   50.0
9   00  19500201    G   45.0
13  48  19820101    G   NaN
14  50  19820101    G   NaN
15  50  19820101    G   NaN
16  52  19820101    G   NaN
17  53  19820101    G   NaN

【讨论】：

谢谢你的回复，我有一个问题，我可以不定义df1['RB'] = '00'，因为RB是一个动态值，我不能定义，它不是始终等于 00，也可以等于 48 或 52 或任何其他数字嗨@William，您不需要定义df1['RB'] = '00'。当我复制您的数据集时，pandas 将该列转换为 int ，从而将所有值从 '00' 更改为 '0' 。那只是为了复制你的数据集

以上是关于如何根据多个条件将 1 个 pandas 数据帧合并或组合到另一个数据帧的主要内容，如果未能解决你的问题，请参考以下文章

如何基于多个条件更快地合并 2 个 pandas 数据帧

根据 pandas df 中的多个条件映射不同的数据帧

根据多个条件将新列添加到 Python Pandas DataFrame [重复]

如何使用for循环或条件在pandas数据框的子集中创建多个回归模型（statsmodel）？

如何根据计数器应用多个条件，并使用 pandas 和 python 在 excel 中为每个条件提供输出？

Groupby并根据Pandas中的多个条件计算计数和均值