有条件的前向填充列[关闭]

Posted

技术标签:

【中文标题】有条件的前向填充列[关闭]【英文标题】:Forward fill column on condition [closed] 【发布时间】:2020-07-03 04:57:57 【问题描述】:

我的数据框是这样的;

df = pd.DataFrame('Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])

如果 col1 在第 2 列中包含值 1,我想向前填充 1 n 次。例如,如果 n = 4,那么我需要这样的结果。

df = pd.DataFrame('Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1])

我想我可以使用带有计数器的 for 循环来做到这一点,每次条件发生时都会重置,但有没有更快的方法来产生相同的结果?

谢谢!

【问题讨论】:

【参考方案1】:

方法 #1: 基于 NumPy 的方法,1D convolution -

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')

更紧凑的单线 -

df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')

方法#2:这是SciPy's binary_dilation -

from scipy.ndimage.morphology import binary_dilation

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')

方法 #3: 使用基于跨步视图的工具从 NumPy 中挤出最好的 -

from skimage.util.shape import view_as_windows

N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
    mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')

基准测试

通过10,000x 放大给定样本的设置 -

In [67]: df = pd.DataFrame('Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
    ...:                    ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
    ...: 
    ...: df = pd.concat([df]*10000)
    ...: df.index = range(len(df.index))

时间

# @jezrael's soln
In [68]: %%timeit
    ...: n = 3
    ...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-1 from this post
In [72]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# App-2 from this post
In [70]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-3 from this post
In [35]: %%timeit
    ...: N = 4 # window size
    ...: mask = df.Col1.values==1
    ...: w = view_as_windows(mask,N)
    ...: idx = len(df)-(N-mask[-N:].argmax())
    ...: if mask[-N:].any():
    ...:     mask[idx:idx+N-1] = 1
    ...: w[mask[:-N+1]] = 1
    ...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @yatu's soln
In [71]: %%timeit
    ...: n = 4
    ...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
    ...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

【讨论】:

【参考方案2】:

对于一般解决方案,将非1 值替换为Series.where 的缺失值并使用限制参数前向填充1 值,最后用原始值替换缺失值:

n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)

print (df)
    Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1

【讨论】:

【参考方案3】:

这是一种基于 NumPy 的方法,使用 np.flatnonzero 来获取 Col1 为 1 的索引,并将广播 sum 的范围最大为 n

n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1

print(df)

     Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1

【讨论】:

【参考方案4】:

reindex 的东西

N=4
s=df.loc[df.Col1==1,'Col1']
idx=s.index
s=s.reindex(idx.repeat(N))
s.index=(idx.values+np.arange(N)[:,None]).ravel('F')

df.Col2.update(s)
df
    Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1

【讨论】:

以上是关于有条件的前向填充列[关闭]的主要内容,如果未能解决你的问题,请参考以下文章

前向填充多列可重用功能代码

高效的前向填充 bigquery

前向填充特定行的特定列

大熊猫的条件前向填充

满足特定条件的 Pandas Dataframe 前向填充

Pandas:使用日期时间索引进行分组前向填充