如何找出 python pandas 数据框列(日期格式)中的空白?

Posted

技术标签:

【中文标题】如何找出 python pandas 数据框列(日期格式)中的空白?【英文标题】:How to find out the gaps in python pandas dataframe column (date format)? 【发布时间】:2018-11-29 05:52:20 【问题描述】:

我有一个如下所示的熊猫数据框:

name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00

我想找出数据框的年份列中与特定名称相关的空白。 例如,

    AAA 名称与“2015-11-03 02:00:00”日期相差 2 小时。 ZZZ 名称在“2015-11-01 01:00:00”日期前有 5 小时的间隔。 ZZZ 名称与“2015-11-01 09:00:00”日期相差 2 小时。

我想生成两个包含内容的 csv 文件:

CSV-1:

name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0

CSV-2:

name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00

我尝试如下:

df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')

【问题讨论】:

【参考方案1】:

要让您的数据框空白,您需要重新分配您生成的 mask。要获得总小时数,您可以简单地除以 1 小时:

df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)

这为我们提供了以下数据框:

   name                year     Gap
0   AAA 2015-11-02 22:00:00     0.0
1   AAA 2015-11-02 23:00:00     1.0
2   AAA 2015-11-03 00:00:00     1.0
3   AAA 2015-11-03 01:00:00     1.0
4   AAA 2015-11-03 02:00:00     1.0
5   AAA 2015-11-03 05:00:00     3.0
6   ZZZ 2015-09-01 00:00:00     0.0
7   ZZZ 2015-11-01 07:00:00     6.0
8   ZZZ 2015-11-01 08:00:00     1.0
9   ZZZ 2015-11-01 09:00:00     1.0
10  ZZZ 2015-11-01 12:00:00     3.0

为了在其开始时间旁边获得间隙并与您希望它为“csv-1”的方式保持一致,我们只需将其向上移动一行并在填充 na 值之前减去一:

df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)

这得到:

   name                year  Gap
0   AAA 2015-11-02 22:00:00  0.0
1   AAA 2015-11-02 23:00:00  0.0
2   AAA 2015-11-03 00:00:00  0.0
3   AAA 2015-11-03 01:00:00  0.0
4   AAA 2015-11-03 02:00:00  2.0
5   AAA 2015-11-03 05:00:00  0.0
6   ZZZ 2015-11-01 01:00:00  5.0
7   ZZZ 2015-11-01 07:00:00  0.0
8   ZZZ 2015-11-01 08:00:00  0.0
9   ZZZ 2015-11-01 09:00:00  2.0
10  ZZZ 2015-11-01 12:00:00  0.0

为了获得您的第二个 csv,我们可以执行以下操作:

df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)

df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
       .resample(rule='1H')\
       .ffill()\
       .reset_index()

gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]

gaps.rename('year': 'gaps', index='columns', inplace=True)

首先我们设置“之前”和“之后”列。然后通过将索引更改为'year',我们可以使用.resample() 方法来填充我们所有缺失的小时数。通过在重新采样时使用ffill(),我们将最后一条可用记录复制到我们添加的所有新行中。我们知道,当'prev_year' != 'year' 时,我们位于框架中以前不存在的行上,因此是间隙之一,因此我们过滤到那些行,选择我们需要的列并重命名它们。这给出了:

   name           prev_year           next_year                year
5   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
6   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
9   ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
10  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
11  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
12  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
13  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
17  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
18  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00

总而言之,您的脚本可能如下所示:

df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)

df.to_csv('csv-1.csv', index=False)

df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)

df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
       .resample(rule='1H')\
       .ffill()\
       .reset_index()

gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]

gaps.rename('year': 'gaps', index='columns', inplace=True)

gaps.to_csv('csv-2.csv', index=False)

【讨论】:

@ason​​gtoruin- 这是我的预期,但你能告诉我关于 CSV-2 的信息吗?

以上是关于如何找出 python pandas 数据框列(日期格式)中的空白?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Plotly 在 Python 中使用 Pandas 数据框列设置散点图悬停信息

Python Pandas:如何将数据框列值设置为 X 轴标签

当列数事先未知时如何访问 Pandas 数据框列

为 PCA 生成加载矩阵时如何将 pandas 数据框列设置为索引

如何迭代熊猫数据框列中的元素?

python pandas数据框列转换为dict键和值