Python pandas计算时间，直到列中的值大于当前期间

Posted 2023-03-11

技术标签:

【中文标题】Python pandas计算时间，直到列中的值大于当前期间【英文标题】：Python pandas calculate time until value in a column is greater than it is in current period 【发布时间】：2016-12-11 01:50:59 【问题描述】：

我在 python 中有一个带有几列和一个日期时间戳的 pandas 数据框。我想创建一个新列，用于计算输出小于当前期间的时间。

我当前的表格如下所示：

 datetime               output
 2014-05-01 01:00:00    3
 2014-05-01 01:00:01    2
 2014-05-01 01:00:02    3
 2014-05-01 01:00:03    2
 2014-05-01 01:00:04    1

我正试图让我的表格有一个额外的列，看起来像这样：

 datetime               output     secondsuntildecrease
 2014-05-01 01:00:00    3         1
 2014-05-01 01:00:01    2         3
 2014-05-01 01:00:02    3         1
 2014-05-01 01:00:03    2         1
 2014-05-01 01:00:04    1

提前致谢！

【问题讨论】：

如何计算第三列的值？这就是提问的目的吗？或者您只是想在现有的 csv 文件中添加一列？ 【参考方案1】：

upper_triangle     = np.triu(df.output.values < df.output.values[:, None])
df['datetime']     = pd.to_datetime(df['datetime'])
df['s_until_dec']  = df['datetime'][upper_triangle.argmax(axis=1)].values - df['datetime']
df.loc[~upper_triangle.any(axis=1), 's_until_dec'] = np.nan
df
             datetime  output           s_until_dec
0 2014-05-01 01:00:00       3              00:00:01
1 2014-05-01 01:00:01       2              00:00:03
2 2014-05-01 01:00:02       3              00:00:01
3 2014-05-01 01:00:03       2              00:00:01
4 2014-05-01 01:00:04       1                   NaT

它是这样工作的：

df.output.values < df.output.values[:, None] 这将创建一个带有广播的成对比较矩阵（[:, None] 创建一个新轴）：

df.output.values < df.output.values[:, None]
Out: 
array([[False,  True, False,  True,  True],
       [False, False, False, False,  True],
       [False,  True, False,  True,  True],
       [False, False, False, False,  True],
       [False, False, False, False, False]], dtype=bool)

例如，这里output[0] 小于output[1]，因此 (0, 1) 的矩阵元素为 True。我们需要上三角，所以我使用np.triu 来获得这个矩阵的上三角。 argmax() 会给我第一个 True 值的索引。如果我将它传递给 iloc，我将得到相应的日期。当然最后一个除外。它有所有Falses，所以我需要用np.nan 替换它。 .loc 部分检查该矩阵的情况并替换为 np.nan。

【讨论】：

【参考方案2】：

df = pd.DataFrame([3, 2, 3, 2, 1], index=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'), columns=['output'])

def f(s):
    s = s[s & (s.index > s.name)]
    if s.empty:
        return np.nan
    else:
        return (s.index[0] - s.name).total_seconds()

df['secondsuntildecrease'] = df['output'].apply(lambda x: df['output'] < x).apply(f, axis=1)

df

输出

                     output  secondsuntildecrease
2014-05-01 01:00:00       3                   1.0
2014-05-01 01:00:01       2                   3.0
2014-05-01 01:00:02       3                   1.0
2014-05-01 01:00:03       2                   1.0
2014-05-01 01:00:04       1                   NaN

【讨论】：

【参考方案3】：

这里是一个班轮

df['seconds_until'] = df.apply(lambda x: pd.to_datetime(df.loc[(df['output'] < x['output']) & (df['datetime'] > x['datetime']), 'datetime'].min()) - pd.to_datetime(x[
'datetime']), axis=1)

输出

              datetime  output  seconds_until
0  2014/05/01 01:00:00       3       00:00:01
1  2014/05/01 01:00:01       2       00:00:03
2  2014/05/01 01:00:02       3       00:00:01
3  2014/05/01 01:00:03       2       00:00:01
4  2014/05/01 01:00:04       1            NaT

【讨论】：

【参考方案4】：

使用 numpy 的外减法得到差异矩阵。

然后使用 numpy 的 triangle 函数进行过滤，以确保我们只对未来时间进行差异化处理，并远离过去。

使用 numpy 的 where 来确保我们不会得到所有 False

最后，取时差。

df = pd.DataFrame(
    dict(output=[3, 2, 3, 2, 1],
         datetime=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'))
)

gt0 = np.triu(np.subtract.outer(df.output, df.output), 1) > 0
idx = np.where(gt0.any(1), gt0.argmax(1), np.nan)
-(df.datetime - df.loc[idx, 'datetime'].values).dt.total_seconds()

0    1.0
1    3.0
2    1.0
3    1.0
4    NaN
Name: datetime, dtype: float64

时间

我的和 ayhan 的似乎在小样本中表现最好

ayhan 最好超过 10,000 行

【讨论】：

以上是关于Python pandas计算时间，直到列中的值大于当前期间的主要内容，如果未能解决你的问题，请参考以下文章