如何一次性检测和删除熊猫数据帧每一列的异常值? [复制]
Posted
技术标签:
【中文标题】如何一次性检测和删除熊猫数据帧每一列的异常值? [复制]【英文标题】:How to detect and remove outliers from each column of pandas dataframe at one go? [duplicate] 【发布时间】:2019-01-23 13:50:49 【问题描述】:我有一个包含六列的 pandas 数据框,我知道每列中都有一些异常值。所以我有这两行代码,它们几乎可以做我想做的事情。但它仅从数据框的一列中删除异常值。那么如果我想一起从每一列中删除异常值呢?
df = pd.DataFrame('stlines':np.random.normal(size=533))
df = df[np.abs(df.stlines-df.stlines.mean()) <= (2*df.stlines.std())]
这样做的优雅方法是什么?
【问题讨论】:
【参考方案1】:我会做如下的事情:
def filter_outliers(df, columns):
masks = (make_mask(df, column) for column in columns)
full_mask = np.logical_or.reduce(masks)
return df[full_mask]
def make_mask(df, column):
standardized = (df[column] - df[column].mean())/df[column].std()
return standardized.abs() >= 2
【讨论】:
【参考方案2】:问题是每列中的异常值可能发生在不同的行(记录)中。我建议您替换 np.nan
设置
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.normal(size=(20, 8)),
columns=list('ABCDEFGH')
)
df
A B C D E F G H
0 -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 1.494226 -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 -2.137846 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 1.633570
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 2.222507 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 1.690625 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
pandas.DataFrame.mask
df.mask((df - df.mean()).abs() > 2 * df.std())
A B C D E F G H
0 NaN -1.268466 NaN NaN -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 NaN -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 NaN 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 NaN
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 NaN 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 NaN 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
+ dropna
如果您只想要任何列不存在异常值的行,您可以使用dropna
跟进上述内容
df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()
A B C D E F G H
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
【讨论】:
【参考方案3】:假设您有多个使用all
的列
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
【讨论】:
以上是关于如何一次性检测和删除熊猫数据帧每一列的异常值? [复制]的主要内容,如果未能解决你的问题,请参考以下文章