Pandas 使用 apply lambda 和两个不同的运算符

Posted 2023-04-18

技术标签:

【中文标题】Pandas 使用 apply lambda 和两个不同的运算符【英文标题】：Pandas using apply lambda with two different operators 【发布时间】：2021-11-11 16:15:27 【问题描述】：

这个问题与我之前发布的问题非常相似，只是做了一处改动。除了对所有列进行绝对差异之外，我还想找到“Z”列的幅度差异，所以如果当前 Z 比 prev 大 1.1 倍，则保留它。

（问题的更多背景）

Pandas using the previous rank values to filter out current row

df = pd.DataFrame(
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
)
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  3.25
# 4     3  4  5  3.00
# 5     3  2  5  6.00

这就是我想要的输出

output = pd.DataFrame(
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
)
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

基本上我想要发生的是，如果前一个等级有任何带有 x、y（双向 +- 1）和 z（

因此，对于排名 1 的行，排名 2 中的任意行具有 x = (-1-1)、y = (-1-1)、z= (

【问题讨论】：

您能否提出更正式的过滤条件？每个等级的行数总是相同吗？ @onepan 不，不同的等级可以有不同的行数 【参考方案1】：

这是使用numpy broadcasting的解决方案：

# Initially, no row is dropped
df['drop'] = False

for r in range(df['rank'].min(), df['rank'].max()):
    # Find the x_min, x_max, y_min, y_max, z_max of the current rank
    cond = df['rank'] == r
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
    x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
    y_min, y_max = y + [[-1], [1]]
    z_max        = z * 1.1

    # Find the x, y, z of the next rank. Raise them one dimension
    # so that we can make a comparison matrix again x_min, x_max, ...
    cond = df['rank'] == r + 1
    if not cond.any():
        continue
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]

    # Condition to drop a row
    drop = (
        (x_min <= x) & (x <= x_max) &
        (y_min <= y) & (y <= y_max) &
        (z <= z_max)
    ).any(axis=1)
    df.loc[cond, 'drop'] = drop

# Result
df[~df['drop']]

精简

更精简的版本（可能更快）。当你未来的队友阅读代码时，这是一个很好的方式来迷惑他们：

r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]

drop = (
    (rr == r + 1) &
    (x-1 <= xx) & (xx <= x+1) &
    (y-1 <= yy) & (yy <= y+1) &
    (zz <= z*1.1)
).any(axis=1)

# Result
df[~drop]

它的作用是比较df 中的每一行（包括其自身）并在以下情况下返回 True（即删除）：

当前行的rank == 另一行的rank + 1；和当前行的x, y, z在另一行x, y, z的指定范围内

【讨论】：

感谢您的意见！你向我展示了一个我以前从未见过的新功能。我选择了另一个答案，因为它与我已经拥有的最相似【参考方案2】：

你需要稍微修改my previous code:

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

输出：

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

【讨论】：

感谢再次回复！但这并不像我想象的那样完全有效。如果要更改索引 3 z: 3.31 它不会出现在输出中，即使 3.31 > 3.00*1.1【参考方案3】：

我已经修改了mozway's function，使它可以按照您的要求工作。

# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check if a previous group exists, else flag all rows False (i.e. not to be dropped)
        #return pd.Series(False, index=d.index)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if differences in x and y are within 1 and z < 1.1*x
        # for at least one row of the previous group
        # then flag the row to be dropped (True)
        
        return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
                                  (s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)

测试，

>>> df = pd.DataFrame(
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
)

>>> df

   rank  x  y     z
0     1  0  0  1.00
1     1  3  4  3.00
2     2  0  0  1.20
3     2  3  4  3.25
4     3  4  5  3.00
5     3  2  5  6.00

>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

>>> df = pd.DataFrame(
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3, 6],
)

>>> df

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
4     3  4  5  3.0
5     3  2  5  6.0


>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
5     3  2  5  6.0

【讨论】：

这很完美！您还可以解释更多关于使用DELTA 变量的原因吗？（我也认为你打算做 ` (s['z'](s['z']<1.1*d_prev['x']) @mike_gundy123，感谢您的赞赏！我使用 DELTA 是因为浮点值总是近似的，所以 1.1*3 ≠ 3.3 准确地说，它可能是 1.1*3=3.299999999999999，因此会造成混淆。无论如何，请在此处查看此答案，您会发现对这种现象的更好解释，***.com/questions/5595425/… (s['z'] 哦不，不是这样，我是说你在比较'Z'和'Z'时，你应该比较'Z'和'Z' @mike_gundy123 但您不需要吗？我引用了你的问题：--------基本上我想要发生的是，如果前一个等级有任何带有 x、y （+- 1 双向）和 z （ 【参考方案4】：

只需从链接的帖子中调整 lamda 方程的 z 项：

return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

以下是适用于我的完整代码：

df = pd.DataFrame(
    'rank': [1, 1, 2, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 3, 4, 2],
    'y': [0, 4, 0, 4, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
)


def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

【讨论】：

不适合我，它说ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 好吧我想我修好了，而不是做.1*d_prev['z'] I had to do .1*s['z'] 您可能想检查一下，我相信与前一个（由 d_prev 表示）相比，第二个 z 值（由“s”表示）的 10%。我相信只需使用另一个答案的代码并像我所做的那样使用 d_prev 应该会给出上一行的 10%。嗯，你是对的，但d_prev 被认为是一个完整的df，其中s（据我的理解）是d 中的每一行，所以我不能完全做到d_prev[['z']] .希望有道理也许我只是太密集了，但这对我不起作用（我复制并粘贴了您的代码）。我收到一条错误消息ValueError: Unable to coerce list of <class 'int'> to Series/DataFrame。我认为来自这里：le([1,1,.1*d_prev['z']]) 所以我将其更改为：le([1,1,.1*d_prev[['z']]])。这给了我在第一次回复您时提到的错误。【参考方案5】：

这适用于我在 Python 3.8.6 上

import pandas as pd

dfg = df.groupby("rank")

def filter_func(dfg):
    for g in dfg.groups.keys():
        if g-1 in dfg.groups.keys():
            yield (
                pd.merge(
                    dfg.get_group(g).assign(id = lambda df: df.index), 
                    dfg.get_group(g-1),
                    how="cross", suffixes=("", "_prev")
                ).assign(
                    cond = lambda df: ~(
                        (df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
                    )
                )
            ).groupby("id").agg(
                
                    **"cond": "all",
                    **k: "first" for k in df.columns
                ).loc[lambda df: df.cond].drop(columns = ["cond"])
        else:
            yield dfg.get_group(g)

pd.concat(
    filter_func(dfg), ignore_index=True
)

输出似乎与您的预期相符：

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     3  2  5  6.0

小编辑：在您的问题中，您似乎关心行索引。我发布的解决方案只是忽略了这一点，但如果您想保留它，只需将其保存为数据框中的附加列即可。

【讨论】：

以上是关于Pandas 使用 apply lambda 和两个不同的运算符的主要内容，如果未能解决你的问题，请参考以下文章