如何将阈值应用于 pandas DataFrame 列并输出阈值之外的行？

Posted 2023-02-23

技术标签:

【中文标题】如何将阈值应用于 pandas DataFrame 列并输出阈值之外的行？【英文标题】：How to apply a threshold to a pandas DataFrame column and output a row outside of the threshold? 【发布时间】：2018-06-08 14:11:10 【问题描述】：

我有一个庞大的产品系列数据集。我试图捕捉任何价格比家庭其他成员高/低的奇怪数据条目。比如我有这个pandas.DataFrame：

df =
Prices    Product Family
0    1.99        Yoplait
1    1.89        Yoplait
2    1.59        Yoplait
3    1.99        Yoplait
4    7.99        Yoplait
5    12.99       Hunts 
6    12.99       Hunts 
7    2.99        Hunts 
8    12.49       Hunts

我想编写一个 for 循环，遍历每个产品系列，设置某种阈值来识别哪些产品有问题（第 4 行和第 7 行），然后输出该行。我怎样才能做到这一点？

到目前为止，我有这个：

families = df['Product Family'].unique() 
for i in families: 
   if df['Prices] .....(set threshold)
   then.....(spit out that row that is questionable)

然后，理想情况下，我会在 for 循环中为每个产品系列完成该 if 语句。有人对如何设置此阈值并完成代码有想法（或更好的想法）吗？

【问题讨论】：

【参考方案1】：

使用 pandas 时，如果可能，最好不要使用循环。在您的情况下，我们可以使用groupby() 进行类似家庭的操作。以下是使用不同于组中位数的值来查找异常值的一种方法：

代码：

df['median'] = df.groupby('Product_Family').transform('median')
df['outlier'] = ((df.Prices - df['median']) / df['median']).abs() > 0.5

测试代码：

import pandas as pd

df = pd.read_fwf(StringIO(u"""
    Prices      Product_Family
    1.99        Yoplait
    1.89        Yoplait
    1.59        Yoplait
    1.99        Yoplait
    7.99        Yoplait
    12.99       Hunts 
    12.99       Hunts 
    2.99        Hunts 
    12.49       Hunts"""),
                 skiprows=1)

df['median'] = df.groupby('Product_Family').transform('median')
df['outlier'] = ((df.Prices - df['median']) / df['median']).abs() > 0.5

print(df[df.outlier])    
print(df)

结果：

   Prices Product_Family  median  outlier
4    7.99        Yoplait    1.99     True
7    2.99          Hunts   12.74     True

   Prices Product_Family  median  outlier
0    1.99        Yoplait    1.99    False
1    1.89        Yoplait    1.99    False
2    1.59        Yoplait    1.99    False
3    1.99        Yoplait    1.99    False
4    7.99        Yoplait    1.99     True
5   12.99          Hunts   12.74    False
6   12.99          Hunts   12.74    False
7    2.99          Hunts   12.74     True
8   12.49          Hunts   12.74    False

【讨论】：

这很棒，有道理！但是我在线上遇到错误： df['median'] = df.groupby('Product_Family').transform('median') 报错说：ValueError: Wrong number of items passed 2, placement意味着1 在 Test Code 下，我给出了运行的确切值。从那里开始，然后进行更改，看看有什么问题。【参考方案2】：

也可以使用分位数进行异常值检测，并像其他答案一样进行分组和转换。以下使用 0.05 和 0.95 分位数作为限制：

# FIND LOWER AND UPPER LIMITS: 
df["lower"] = df.groupby("ProductFamily").transform(lambda x: x.quantile(0.05))
df["upper"] = df.iloc[:,0:2].groupby("ProductFamily").transform(lambda x: x.quantile(0.95))
print(df) 

# SELECT ROWS THAT MEET CRITERIA: 
df = df[(df.Prices > df.lower) & (df.Prices < df.upper)]
print(df)

# TO KEEP ORIGINAL 2 COLUMNS:
df = df.iloc[:,0:2]
print(df)

输出：

   Prices ProductFamily  lower  upper
0    1.99       Yoplait  1.650   6.79
1    1.89       Yoplait  1.650   6.79
2    1.59       Yoplait  1.650   6.79
3    1.99       Yoplait  1.650   6.79
4    7.99       Yoplait  1.650   6.79
5   12.99         Hunts  4.415  12.99
6   12.99         Hunts  4.415  12.99
7    2.99         Hunts  4.415  12.99
8   12.49         Hunts  4.415  12.99

   Prices ProductFamily  lower  upper
0    1.99       Yoplait  1.650   6.79
1    1.89       Yoplait  1.650   6.79
3    1.99       Yoplait  1.650   6.79
8   12.49         Hunts  4.415  12.99

   Prices ProductFamily
0    1.99       Yoplait
1    1.89       Yoplait
3    1.99       Yoplait
8   12.49         Hunts

【讨论】：

【参考方案3】：

好吧，我想我的方式类似于 Stephen Rauch 的方式。唯一的区别是我对每个组的prices 进行标准化/规范化。

# Standardize or normalize the `Prices` per `ProductFamily` (absolute value)
df_std = df.groupby('ProductFamily').transform(lambda x: np.abs((x - x.mean()) / x.std()))

# We assume that any Price beyond one standard deviation is an outlier
outlier_mask = df_std['Prices'] > 1.0

# Split clean and outlier dataframes
df_clean = df[~outlier_mask]
df_outlier = df[outlier_mask]

【讨论】：

以上是关于如何将阈值应用于 pandas DataFrame 列并输出阈值之外的行？的主要内容，如果未能解决你的问题，请参考以下文章

Pandas + scikit-learn - 如何将二维数组转换应用于 DataFrame

如何将方法应用于 Pandas Dataframe [重复]

将函数应用于 Pandas.DataFrame 中两列的每个组合的更好方法

pandas使用dataframe中的两列时间对象数据列作差生成时间差数据列筛选dataframe数据中时间差大于指定阈值的数据行（时间差timedelta大于指定阈值的样本数据）

Pandas：根据阈值标准删除列

有效地将函数并行应用于分组的 pandas DataFrame