更改 pandas datetime64 列的时间组件
Posted
技术标签:
【中文标题】更改 pandas datetime64 列的时间组件【英文标题】:Changing time components of pandas datetime64 column 【发布时间】:2016-01-26 13:16:03 【问题描述】:我有一个可以简化为的数据框:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
可以使用以下方法创建:
tempDF = pd.DataFrame( 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"])
数据有以下几种类型:
tempDF.dtypes
date object
id int64
dtype: object
我已将“日期”变量设置为 Pandas datefime64 格式(如果这是描述它的正确方式),使用:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
所以现在,dtypes 看起来像:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
我想更改原始日期数据的小时数。我可以使用 .normalize() 通过 .dt 访问器转换为午夜:
tempDF['date'] = tempDF['date'].dt.normalize()
而且,我可以使用以下方法访问各个日期时间组件(例如年份):
tempDF['date'].dt.year
这会产生:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
问题是,如何更改特定的日期和时间组件?例如,如何更改所有日期的中午(12:00)?我发现 datetime.datetime 有一个 .replace() 函数。但是,将日期转换为 Pandas 格式后,保持该格式是有意义的。有没有办法在不再次更改格式的情况下做到这一点?
【问题讨论】:
【参考方案1】:这是我用来替换 Pandas DataFrame 中日期时间值的时间组件的解决方案。不确定这个解决方案的效率如何,但它符合我的需求。
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame('Date': dtList)
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
不确定这个解决方案的效率如何,但它符合我的需求。
【讨论】:
【参考方案2】:编辑:
执行此操作的矢量化方法是将系列标准化,然后使用 timedelta
将 12
小时添加到它。示例 -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
演示 -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
两种方法的时间信息在底部
一种方法是使用Series.apply
以及OP 在他的帖子中提到的.replace()
方法。示例 -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
演示 -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
时间信息
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop
【讨论】:
很好的答案。谢谢你。我一直避免使用 lambda 函数,因为我通常包含超过一百万行的数据帧,而且我认为 lambda 函数会很慢。但是,也许,我需要重新审视这些功能。有没有办法使用基于列的方法而不是逐行执行相同的操作? 我找到了一个矢量化的方法,检查一下,在答案中更新。 将时间增量添加到带有时区和夏令时的时间戳可能会给您带来意想不到的结果。 (pd.Timestamp('2022-03-27 00:00', tz='CET') + pd.Timedelta(12, unit='h')).hour == 13以上是关于更改 pandas datetime64 列的时间组件的主要内容,如果未能解决你的问题,请参考以下文章
如何让 pandas.read_csv() 从 CSV 文件列中推断 datetime 和 timedelta 类型?
Pandas datetime64 问题(datetime 引入数据峰值)
如何从 pandas.DatetimeIndex 转换为 numpy.datetime64?