Pandas 检查最后 N 行的值，基于结果的新列

Posted 2023-02-23

技术标签:

【中文标题】Pandas 检查最后 N 行的值，基于结果的新列【英文标题】：Pandas Check last N Rows for values, new column based on results 【发布时间】：2018-04-17 05:23:15 【问题描述】：

我有一个 DataFrame，Df2。我正在尝试检查下面Lead_Lag 列的最后 10 行中的每一行 - 如果在任何这些行中除了 null 之外还有任何值，那么我想要一个新列 Position 等于 'Y'：

def run_HG_AUDUSD_15M_Aggregate():
    Df1 = pd.read_csv(max(glob.iglob(r"C:\Users\cost9\OneDrive\Documents\PYTHON\Daily Tasks\Pairs Trading\HG_AUDUSD\CSV\15M\Lead_Lag\*.csv"), key=os.path.getctime))    
    Df2 = Df1[['Date', 'Close_HG', 'Close_AUDUSD', 'Lead_Lag']]

    Df2['Position'] = ''

    for index,row in Df2.iterrows():
        if Df2.loc[Df2.index.shift(-10):index,"Lead_Lag"].isnull(): 
            continue
        else:
            Df2.loc[index, 'Position'] = "Y"

数据样本如下：

Date	Close_HG	Close_AUDUSD	Lead_Lag
7/19/2017 12:59	2.7	0.7956	
7/19/2017 13:59	2.7	0.7955	
7/19/2017 14:14	2.7	0.7954	
7/20/2017 3:14	2.7	0.791	
7/20/2017 5:44	2.7	0.791	
7/20/2017 7:44	2.71	0.7925	
7/20/2017 7:59	2.7	0.7924	
7/20/2017 8:44	2.7	0.7953	Short_Both
7/20/2017 10:44	2.71	0.7964	Short_Both
7/20/2017 11:14	2.71	0.7963	Short_Both
7/20/2017 11:29	2.71	0.7967	Short_Both
7/20/2017 13:14	2.71	0.796	Short_Both
7/20/2017 13:29	2.71	0.7956	Short_Both
7/20/2017 14:29	2.71	0.7957	Short_Both

因此，在这种情况下，我希望新列 Position 的最后两个值是 'Y'，因为在最后 10 行中的至少一个中，Lead_Lag 列中有值。我想滚动应用这个 - 例如第 13 行“位置”值将查看第 12-3 行，第 12 行“位置”值将查看第 11-2 行等。

相反，我得到了错误：

NotImplementedError: RangeIndex 类型不支持

我尝试了几种 shift 方法的变体（在循环之前定义等），但无法让它工作。

编辑：这是解决方案：

N = 10
Df2['Position'] = ''
for index,row in Df2.iterrows():
    if (Df2.loc[index-N:index,"Lead_Lag"] != "N").any():
        Df2.loc[index, 'Position'] = "Y"
    else:
        Df2.loc[index, 'Position'] = "N"

【问题讨论】：

请将您的解决方案添加为问题的答案而不是此编辑，要了解更多信息，请参阅tour 【参考方案1】：

编辑：

发布有问题的解决方案后，我发现 OP 需要其他东西 - 测试窗口 N，因此添加了另一个 answer。

旧解决方案：

通过链接使用numpy.where 和布尔掩码：

m = df["Lead_Lag"].notnull() & df.index.isin(df.index[-10:])

或者通过iloc按位置选择列并通过reindex添加Falses：

m = df["Lead_Lag"].iloc[-10:].notnull().reindex(df.index, fill_value=False)

df['new'] = np.where(m, 'Y', '')

print (df)
               Date  Close_HG  Close_AUDUSD    Lead_Lag new
0   7/19/2017 12:59      2.70        0.7956         NaN    
1   7/19/2017 13:59      2.70        0.7955         NaN    
2   7/19/2017 14:14      2.70        0.7954         NaN    
3    7/20/2017 3:14      2.70        0.7910         NaN    
4    7/20/2017 5:44      2.70        0.7910         NaN    
5    7/20/2017 7:44      2.71        0.7925         NaN    
6    7/20/2017 7:59      2.70        0.7924         NaN    
7    7/20/2017 8:44      2.70        0.7953  Short_Both   Y
8   7/20/2017 10:44      2.71        0.7964  Short_Both   Y
9   7/20/2017 11:14      2.71        0.7963  Short_Both   Y
10  7/20/2017 11:29      2.71        0.7967  Short_Both   Y
11  7/20/2017 13:14      2.71        0.7960  Short_Both   Y
12  7/20/2017 13:29      2.71        0.7956  Short_Both   Y
13  7/20/2017 14:29      2.71        0.7957  Short_Both   Y

【讨论】：

"如果除了null还有值，则放置Y"，所以你必须要么改变。 isnull() 到 notnull() 或交换 np.where 中的参数 nit：最后在, 之后添加空格以符合 pep8。 @jezrael - 所以上面的第 13 行将回顾第 3 行以查看第 12-3 行中是否有非空值。第 12 行将回顾第 2 行等，因此您拥有的“新”列将滚动。非常感谢，耶兹瑞尔。我稍微更改了您的代码以使其正常工作 - 它现在在原始帖子中。 @jezrael 我有一个类似的问题，除了这个：“df.index.isin(df.index[-10:]”我需要检查当前行之前的所有行而不是10.如果简单的话，你知道那个案例的修改吗？【参考方案2】：

这就是我最终做的：

def run_HG_AUDUSD_15M_Aggregate():


N = 10
Df2['Position'] = ''

for index,row in Df2.iterrows():
    if (Df2.loc[index-N:index,"Lead_Lag"] != "N").any():
        Df2.loc[index, 'Position'] = "Y"
    else:
        Df2.loc[index, 'Position'] = "N"

【讨论】：

【参考方案3】：

示例：

np.random.seed(123)
M = 20
Df2 = pd.DataFrame('Lead_Lag':np.random.choice([np.nan, 'N'], p=[.3,.7], size=M))

解决方案 1 - 熊猫：

解释：首先比较列是否不等于Series.ne 以获得布尔值Series，然后将Series.rolling 与Series.any 用于窗口中的测试值 - 最后由numpy.where 设置N 和Y：

N = 3

a = (Df2['Lead_Lag'].ne('N')
                    .rolling(N, min_periods=1)
                    .apply(lambda x: x.any(), raw=False))      
Df2['Pos1'] = np.where(a, 'Y','N')

另一个带有 strides 的 numpy 解决方案并将前 N 个值更正为设置为 Falses：

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

x = np.concatenate([[False] * (N - 1), Df2['Lead_Lag'].ne('N').values])
arr = np.any(rolling_window(x, N), axis=1)

Df2['Pos2'] = np.where(arr, 'Y','N')

比较输出：

print (Df2)
   Lead_Lag Pos1 Pos2
0         N    N    N
1       nan    Y    Y
2       nan    Y    Y
3         N    Y    Y
4         N    Y    Y
5         N    N    N
6         N    N    N
7         N    N    N
8         N    N    N
9         N    N    N
10        N    N    N
11        N    N    N
12        N    N    N
13      nan    Y    Y
14        N    Y    Y
15        N    Y    Y
16      nan    Y    Y
17      nan    Y    Y
18        N    Y    Y
19        N    Y    Y

numpy 解决方案详情：

为测试前 N -1 个值添加 False 值：

print (np.concatenate([[False] * (N - 1), Df2['Lead_Lag'].ne('N').values]))
[False False False  True  True False False False False False False False
 False False False  True False False  True  True False False]

Strides 返回二维布尔数组：

print (rolling_window(x, N))
[[False False False]
 [False False  True]
 [False  True  True]
 [ True  True False]
 [ True False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False  True]
 [False  True False]
 [ True False False]
 [False False  True]
 [False  True  True]
 [ True  True False]
 [ True False False]]

numpy.any 每行测试至少一个 True：

print (np.any(rolling_window(x, N), axis=1))
[False  True  True  True  True False False False False False False False
 False  True  True  True  True  True  True  True]

编辑：

如果使用iterrows 解决方案进行测试，则输出不同。原因是此解决方案在N + 1 窗口中进行测试，因此对于相同的输出，需要将1 添加到N：

N = 3
Df2['Position'] = ''

for index,row in Df2.iterrows():
    #for check windows
    #print (Df2.loc[index-N:index,"Lead_Lag"])
    if (Df2.loc[index-N:index,"Lead_Lag"] != "N").any():
        Df2.loc[index, 'Position'] = "Y"
    else:
        Df2.loc[index, 'Position'] = "N"

a = (Df2['Lead_Lag'].ne('N')
                    .rolling(N + 1, min_periods=1)
                    .apply(lambda x: x.any(), raw=False)  )      
Df2['Pos1'] = np.where(a, 'Y','N')

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

x = np.concatenate([[False] * (N), Df2['Lead_Lag'].ne('N').values])
arr = np.any(rolling_window(x, N + 1), axis=1)

Df2['Pos2'] = np.where(arr, 'Y','N')

print (Df2)

   Lead_Lag Position Pos1 Pos2
0         N        N    N    N
1       nan        Y    Y    Y
2       nan        Y    Y    Y
3         N        Y    Y    Y
4         N        Y    Y    Y
5         N        Y    Y    Y
6         N        N    N    N
7         N        N    N    N
8         N        N    N    N
9         N        N    N    N
10        N        N    N    N
11        N        N    N    N
12        N        N    N    N
13      nan        Y    Y    Y
14        N        Y    Y    Y
15        N        Y    Y    Y
16      nan        Y    Y    Y
17      nan        Y    Y    Y
18        N        Y    Y    Y
19        N        Y    Y    Y

【讨论】：

以上是关于Pandas 检查最后 N 行的值，基于结果的新列的主要内容，如果未能解决你的问题，请参考以下文章