熊猫离开并更新现有列
Posted
技术标签:
【中文标题】熊猫离开并更新现有列【英文标题】:pandas left join and update existing column 【发布时间】:2021-01-03 07:27:56 【问题描述】:我是 pandas 的新手,似乎无法使用合并功能:
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
使用 a 列的左连接,我想通过连接键更新公共列。注意 c 列中的最后一个值来自 LEFT 表,因为没有匹配项。
>>> final
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NAN
我应该如何使用 Pandas 合并功能来做到这一点?谢谢。
【问题讨论】:
【参考方案1】:这是使用combine_first()
的另一种方法
right.set_index('a').combine_first(left.set_index('a')).reset_index()
【讨论】:
【参考方案2】:另一种方法是像这样使用pd.merge:
>>> import pandas as pd
>>> final = pd.merge(left=right, right=left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
将两个数据框列的交集提供给函数的“on=”参数。
这不会像 Zero 的解决方案那样创建必须删除的不需要的列。
NaN 值可能会在同一列中将整数更改为浮点数。
编辑:这适用于 Pandas 版本
【讨论】:
嘿,您的解决方案是唯一对我有用的解决方案,我有数千列。它挽救了我的一天,也许也挽救了我的生命。谢谢。 问题,为什么on=('a', 'c')
?
@MarceloGazzola 'a' 因为左右帧的列值相同。合并它们将删除其中一个。 'c' 因为 OP 希望用右帧的 'c' 列的值更新左帧的行。无需添加“b”和“d”,因为它们在两个框架中都不存在。【参考方案3】:
DataFrame.update() 很好,但它不允许您指定要加入的列,更重要的是,如果 other 数据框有 NaN 值,这些 NaN 值不会覆盖非 nan 值在原始数据框中。对我来说,这是不受欢迎的行为。
这是我为解决这些问题而采用的一种自定义方法。它是新写的,所以用户要小心..
join_insertion()
def join_insertion(into_df, from_df, cols, on, by=None, direction=None, mult='error'):
"""
Suppose A and B are dataframes. A has columns foo, bar, baz and B has columns foo, baz, buz
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
Optionally, direction can be given as 'backward', 'forward', or nearest to implement a rolling join
insertion. forward means 'roll into_df values forward to match from_df values', etc. Additionally,
when doing a rolling join, 'on' should be the roll column and 'by' should be the exact-match columns.
See pandas.merge_asof() for details.
Note that 'mult' gets ignored when doing a rolling join. In the case of a rolling join, the first
appearing record is kept, even if two records match a key from the same distance. Perhaps this
can be improved...
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of into:from column name pairs
:param by: same format as on; when doing a rolling join insertion, what columns to exact-match on
:param direction: 'forward', 'backward', or 'nearest'. forward means roll into_df values to match from_df
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Infer left_by, right_by
if(by is not None):
if (isinstance(by, dict)):
left_by = list(by.keys())
right_by = list(by.values())
elif (isinstance(by, list)):
left_by = by
right_by = by
elif (isinstance(by, str)):
left_by = [by]
right_by = [by]
else:
raise Exception("by should be a list or dictionary")
else:
left_by = None
right_by = None
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols + ([] if right_by is None else right_by)].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
# Merge
if(direction is None):
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if (mult == 'error'):
if (A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif (mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif (mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
else:
A.sort_values(left_on, inplace=True)
B.sort_values(right_on, inplace=True)
A = pd.merge_asof(
left=A,
right=B,
direction=direction,
left_on=left_on,
right_on=right_on,
left_by=left_by,
right_by=right_by,
suffixes=(None, '_y')
).sort_values(['_A_RowId_', '_B_RowId_'])
# Insert values from new column(s) into pre-existing column(s)
mask = A._merge == 'both' if direction is None else np.repeat(True, A.shape[0])
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
使用示例
into_df = pd.DataFrame(
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
)
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame(
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
)
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
顺便说一句,这是我从 R 的 data.table 包中严重错过的那些东西之一。使用 data.table,这就像x[y, Foo := i.Foo, on = c("a", "b")]
一样简单
【讨论】:
【参考方案4】:一种方法是将a列设置为索引和update
:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
注意:update
仅进行左连接(不合并),因此除了 set_index 之外,您还需要包含 left_a
中不存在的其他列。
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
【讨论】:
对实施此解决方案的人的警告:在某些情况下,整数 dtype 会更改为浮点数! ***.com/questions/17398216/… 警告:不推荐使用loc
添加新列。改用reindex
像这样的left_a.reindex(columns=left_a.columns.union(right_a.columns))
。
欢迎您。你给了我我需要的答案,并建立在 ;)
警告:根据docs,DataFrame.update()
不会用 nan 值覆盖非 nan 值。【参考方案5】:
您可以在left
和right
之间使用merge()
,并在'a'
列上使用how='left'
。
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
将c_y
中的NaN
值替换为c_x
值
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
删除不需要的列,你就有了结果
In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
a b d c
0 1 4 13 7
1 2 5 14 8
2 3 6 15 9
3 4 7 NaN 12
【讨论】:
那个fillna(有不同的列)非常整洁! 我比公认的答案更喜欢这种方法,因为它不依赖于两个具有共同连接键变量(本例中为“a”)的 DataFrame。 我在使用此代码时始终收到此错误:FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as替代。我唯一的想法是我的 dfs 可能不共享相同的列?这不是原来的答案应该纠正的吗? 这个答案是最pythonic的方法【参考方案6】:这是使用join
的一种方法:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
现在,我们要使用来自left
数据帧的c
列的值设置c_right
(来自right
数据帧)的空值。使用取自@John Galt 的答案的方法更新了以下过程
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
【讨论】:
有些奇怪,因为您正在更新t['c_right']
,但随后又将其删除。我想你想要t['c'] = t['c_right'].fillna(t['c'])
,然后放弃t['c_right']
。以上是关于熊猫离开并更新现有列的主要内容,如果未能解决你的问题,请参考以下文章