查找DataFrame中两列之间的时间差[重复]
Posted
技术标签:
【中文标题】查找DataFrame中两列之间的时间差[重复]【英文标题】:Finding time difference between two columns in DataFrame [duplicate] 【发布时间】:2016-06-12 13:49:25 【问题描述】:我正在尝试查找以下帧的两列之间的时间差:
考试日期 |测试类型 |首次使用日期
我使用以下函数定义来获得区别:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
它工作得很好,但是它不需要一个系列作为输入。所以我不得不构造一个循环遍历索引的 for 循环:
age_veh = []
for i in range(0, len(data_manufacturer)-1):
age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4]))
但是,它确实返回错误: IndexError: 列表索引超出范围
我不知道这是否是正确的做法以及我做错了什么或替代解决方案将不胜感激。还请记住,我有大约 2 百万行。
【问题讨论】:
为什么不直接将列转换为日期时间,然后减去列?df['Test Date'] = pd.to_datetime(df['Test Date']
等等,然后df['Test Date'] - df['First Use Date']
会返回一个timedelta
应该可以,谢谢!
【参考方案1】:
IIUC你可以先转换列to_datetime
,使用abs
再转换timedelta
为days
:
print df
id value date1 date2 sum
0 A 150 2014-04-08 2014-03-08 NaN
1 B 100 2014-05-08 2014-02-08 NaN
2 B 200 2014-01-08 2014-07-08 100
3 A 200 2014-04-08 2014-03-08 NaN
4 A 300 2014-06-08 2014-04-08 350
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df['diff'] = (df['date1'] - df['date2']).abs() / np.timedelta64(1, 'D')
print df
id value date1 date2 sum diff
0 A 150 2014-04-08 2014-03-08 NaN 31
1 B 100 2014-05-08 2014-02-08 NaN 89
2 B 200 2014-01-08 2014-07-08 100 181
3 A 200 2014-04-08 2014-03-08 NaN 31
4 A 300 2014-06-08 2014-04-08 350 61
编辑:
我认为将np.timedelta64(1, 'D')
转换为更大的DataFrames
中的days
更好,因为它更快:
我使用 EdChum sample,只使用 len(df) = 4k
:
import io
import pandas as pd
import numpy as np
t=u"""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df = pd.concat([df]*1000).reset_index(drop=True)
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
print (df['Test Date'] - df['First Use Date']).abs().dt.days
print (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
时间安排:
In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days
10 loops, best of 3: 38.8 ms per loop
In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs() / np.timedelta64(1, 'D')
1000 loops, best of 3: 1.62 ms per loop
【讨论】:
【参考方案2】:使用to_datetime
转换列,然后您可以减去列以在abs
值上生成timedelta
,然后您可以调用dt.days
来获取总天数,例如:
In [119]:
import io
import pandas as pd
t="""Test Date,Test Type,First Use Date
2011-02-05,A,2010-01-05
2012-02-05,A,2010-03-05
2013-02-05,A,2010-06-05
2014-02-05,A,2010-08-05"""
df = pd.read_csv(io.StringIO(t))
df
Out[119]:
Test Date Test Type First Use Date
0 2011-02-05 A 2010-01-05
1 2012-02-05 A 2010-03-05
2 2013-02-05 A 2010-06-05
3 2014-02-05 A 2010-08-05
In [121]:
df['Test Date'] = pd.to_datetime(df['Test Date'])
df['First Use Date'] = pd.to_datetime(df['First Use Date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
Test Date 4 non-null datetime64[ns]
Test Type 4 non-null object
First Use Date 4 non-null datetime64[ns]
dtypes: datetime64[ns](2), object(1)
memory usage: 128.0+ bytes
In [122]:
df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days
df
Out[122]:
Test Date Test Type First Use Date days
0 2011-02-05 A 2010-01-05 396
1 2012-02-05 A 2010-03-05 702
2 2013-02-05 A 2010-06-05 976
3 2014-02-05 A 2010-08-05 1280
【讨论】:
以上是关于查找DataFrame中两列之间的时间差[重复]的主要内容,如果未能解决你的问题,请参考以下文章
如何将 DataFrame 中两列中的两个日期和时间合并为一列? [复制]