熊猫离开并更新现有列

Posted 2023-03-12

技术标签:

【中文标题】熊猫离开并更新现有列【英文标题】：pandas left join and update existing column 【发布时间】：2021-01-03 07:27:56 【问题描述】：

我是 pandas 的新手，似乎无法使用合并功能：

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4   9    0  1  7  13
1  2  5  10    1  2  8  14
2  3  6  11    2  3  9  15
3  4  7  12

使用 a 列的左连接，我想通过连接键更新公共列。注意 c 列中的最后一个值来自 LEFT 表，因为没有匹配项。

>>> final       
   a  b   c   d 
0  1  4   7   13
1  2  5   8   14
2  3  6   9   15
3  4  7   12  NAN

我应该如何使用 Pandas 合并功能来做到这一点？谢谢。

【问题讨论】：

【参考方案1】：

这是使用combine_first()的另一种方法

right.set_index('a').combine_first(left.set_index('a')).reset_index()

【讨论】：

【参考方案2】：

另一种方法是像这样使用pd.merge：

 >>> import pandas as pd

 >>> final = pd.merge(left=right, right=left, 
                      how='outer',
                      left_index=True,
                      right_index=True,
                      on=('a', 'c')
                     ).sort_index(axis=1)

 >>> final       
    a  b   c   d 
 0  1  4   7   13.0
 1  2  5   8   14.0
 2  3  6   9   15.0
 3  4  7   12  NaN

将两个数据框列的交集提供给函数的“on=”参数。

这不会像 Zero 的解决方案那样创建必须删除的不需要的列。

NaN 值可能会在同一列中将整数更改为浮点数。

编辑：这适用于 Pandas 版本

【讨论】：

嘿，您的解决方案是唯一对我有用的解决方案，我有数千列。它挽救了我的一天，也许也挽救了我的生命。谢谢。问题，为什么on=('a', 'c')？ @MarceloGazzola 'a' 因为左右帧的列值相同。合并它们将删除其中一个。 'c' 因为 OP 希望用右帧的 'c' 列的值更新左帧的行。无需添加“b”和“d”，因为它们在两个框架中都不存在。【参考方案3】：

DataFrame.update() 很好，但它不允许您指定要加入的列，更重要的是，如果 other 数据框有 NaN 值，这些 NaN 值不会覆盖非 nan 值在原始数据框中。对我来说，这是不受欢迎的行为。

这是我为解决这些问题而采用的一种自定义方法。它是新写的，所以用户要小心..

join_insertion()

def join_insertion(into_df, from_df, cols, on, by=None, direction=None, mult='error'):
    """
    Suppose A and B are dataframes. A has columns foo, bar, baz and B has columns foo, baz, buz
    This function allows you to do an operation like:
    "where A and B match via the column foo, insert the values of baz and buz from B into A"
    Note that this'll update A's values for baz and it'll insert buz as a new column.
    This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!

    Optionally, direction can be given as 'backward', 'forward', or nearest to implement a rolling join
    insertion. forward means 'roll into_df values forward to match from_df values', etc. Additionally,
    when doing a rolling join, 'on' should be the roll column and 'by' should be the exact-match columns.
    See pandas.merge_asof() for details.

    Note that 'mult' gets ignored when doing a rolling join. In the case of a rolling join, the first
    appearing record is kept, even if two records match a key from the same distance. Perhaps this
    can be improved...

    :param into_df: dataframe you want to modify
    :param from_df: dataframe with the values you want to insert
    :param cols: list of column names (values to insert)
    :param on: list of column names (values to join on), or a dict of into:from column name pairs
    :param by: same format as on; when doing a rolling join insertion, what columns to exact-match on
    :param direction: 'forward', 'backward', or 'nearest'. forward means roll into_df values to match from_df
    :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
    an error can be raised, or the first matching value can be inserted, or the last matching value
    can be inserted
    :return: a modified copy of into_df, with updated values using from_df
    """

    # Infer left_on, right_on
    if (isinstance(on, dict)):
        left_on = list(on.keys())
        right_on = list(on.values())
    elif(isinstance(on, list)):
        left_on = on
        right_on = on
    elif(isinstance(on, str)):
        left_on = [on]
        right_on = [on]
    else:
        raise Exception("on should be a list or dictionary")

    # Infer left_by, right_by
    if(by is not None):
        if (isinstance(by, dict)):
            left_by = list(by.keys())
            right_by = list(by.values())
        elif (isinstance(by, list)):
            left_by = by
            right_by = by
        elif (isinstance(by, str)):
            left_by = [by]
            right_by = [by]
        else:
            raise Exception("by should be a list or dictionary")
    else:
        left_by = None
        right_by = None

    # Make cols a list if it isn't already
    if(isinstance(cols, str)):
        cols = [cols]

    # Setup
    A = into_df.copy()
    B = from_df[right_on + cols + ([] if right_by is None else right_by)].copy()

    # Insert row ids
    A['_A_RowId_'] = np.arange(A.shape[0])
    B['_B_RowId_'] = np.arange(B.shape[0])

    # Merge
    if(direction is None):
        A = pd.merge(
            left=A,
            right=B,
            how='left',
            left_on=left_on,
            right_on=right_on,
            suffixes=(None, '_y'),
            indicator=True
        ).sort_values(['_A_RowId_', '_B_RowId_'])

        # Check for rows of A which got duplicated by the merge, and then handle appropriately
        if (mult == 'error'):
            if (A.groupby('_A_RowId_').size().max() > 1):
                raise Exception("At least one key of into_df matched multiple rows of from_df.")
        elif (mult == 'first'):
            A = A.groupby('_A_RowId_').first().reset_index()
        elif (mult == 'last'):
            A = A.groupby('_A_RowId_').last().reset_index()

    else:
        A.sort_values(left_on, inplace=True)
        B.sort_values(right_on, inplace=True)
        A = pd.merge_asof(
            left=A,
            right=B,
            direction=direction,
            left_on=left_on,
            right_on=right_on,
            left_by=left_by,
            right_by=right_by,
            suffixes=(None, '_y')
        ).sort_values(['_A_RowId_', '_B_RowId_'])

    # Insert values from new column(s) into pre-existing column(s)
    mask = A._merge == 'both' if direction is None else np.repeat(True, A.shape[0])
    cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
    for col in cols_in_both:
        A.loc[mask, col] = A.loc[mask, col + '_y']

    # Drop unwanted columns
    A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)

    return A

使用示例

into_df = pd.DataFrame(
    'foo': [1, 2, 3],
    'bar': [4, 5, 6],
    'baz': [7, 8, 9]
)
   foo  bar  baz
0    1    4    7
1    2    5    8
2    3    6    9

from_df = pd.DataFrame(
    'foo': [1, 3, 5, 7, 3],
    'baz': [70, 80, 90, 30, 40],
    'buz': [0, 1, 2, 3, 4]
)
   foo  baz  buz
0    1   70    0
1    3   80    1
2    5   90    2
3    7   30    3
4    3   40    4

# Use it!

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
  Exception: At least one key of into_df matched multiple rows of from_df.

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  80.0  1.0

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  40.0  4.0

顺便说一句，这是我从 R 的 data.table 包中严重错过的那些东西之一。使用 data.table，这就像x[y, Foo := i.Foo, on = c("a", "b")] 一样简单

【讨论】：

【参考方案4】：

一种方法是将a列设置为索引和update：

In [11]: left_a = left.set_index('a')

In [12]: right_a = right.set_index('a')

注意：update 仅进行左连接（不合并），因此除了 set_index 之外，您还需要包含 left_a 中不存在的其他列。

In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))

In [14]: res.update(right_a)

In [15]: res.reset_index(inplace=True)

In [16]: res
Out[16]:
   a   b   c   d
0  1   4   7  13
1  2   5   8  14
2  3   6   9  15
3  4   7  12 NaN

【讨论】：

对实施此解决方案的人的警告：在某些情况下，整数 dtype 会更改为浮点数！ ***.com/questions/17398216/… 警告：不推荐使用loc 添加新列。改用reindex 像这样的left_a.reindex(columns=left_a.columns.union(right_a.columns))。欢迎您。你给了我我需要的答案，并建立在 ;) 警告：根据docs，DataFrame.update() 不会用 nan 值覆盖非 nan 值。【参考方案5】：

您可以在left 和right 之间使用merge()，并在'a' 列上使用how='left'。

In [74]: final = left.merge(right, on='a', how='left')

In [75]: final
Out[75]:
   a  b  c_x  c_y   d
0  1  4    9    7  13
1  2  5   10    8  14
2  3  6   11    9  15
3  4  7   12  NaN NaN

将c_y 中的NaN 值替换为c_x 值

In [76]: final['c'] = final['c_y'].fillna(final['c_x'])

In [77]: final
Out[77]:
   a  b  c_x  c_y   d   c
0  1  4    9    7  13   7
1  2  5   10    8  14   8
2  3  6   11    9  15   9
3  4  7   12  NaN NaN  12

删除不需要的列，你就有了结果

In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
   a  b   d   c
0  1  4  13   7
1  2  5  14   8
2  3  6  15   9
3  4  7 NaN  12

【讨论】：

那个fillna（有不同的列）非常整洁！我比公认的答案更喜欢这种方法，因为它不依赖于两个具有共同连接键变量（本例中为“a”）的 DataFrame。我在使用此代码时始终收到此错误：FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as替代。我唯一的想法是我的 dfs 可能不共享相同的列？这不是原来的答案应该纠正的吗？这个答案是最pythonic的方法【参考方案6】：

这是使用join 的一种方法：

In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')

In [633]: t
Out[633]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12      NaN NaN

现在，我们要使用来自left 数据帧的c 列的值设置c_right（来自right 数据帧）的空值。使用取自@John Galt 的答案的方法更新了以下过程

In [657]: t['c_right'] = t['c_right'].fillna(t['c'])

In [658]: t
Out[658]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12       12 NaN

In [659]: t.drop('c_right', axis=1)
Out[659]: 
   b   c   d
a           
1  4   9  13
2  5  10  14
3  6  11  15
4  7  12 NaN

【讨论】：

有些奇怪，因为您正在更新 t['c_right']，但随后又将其删除。我想你想要t['c'] = t['c_right'].fillna(t['c'])，然后放弃t['c_right']。

以上是关于熊猫离开并更新现有列的主要内容，如果未能解决你的问题，请参考以下文章