熊猫:如何在偏移日期合并两个数据框?
Posted
技术标签:
【中文标题】熊猫:如何在偏移日期合并两个数据框?【英文标题】:Pandas: how to merge two dataframes on offset dates? 【发布时间】:2015-08-09 23:14:06 【问题描述】:我想根据 df2 行是否在 df1 行之后的 3-6 个月日期范围内合并两个数据框 df1 和 df2。例如:
df1(对于我有季度数据的每家公司):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2(对于每家公司,我都有可以在任何一天发生的活动日期):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
理想的合并 df -- df2 中 EventDates 的行在 df1 行中的 DATADATE 之后 3-6 个月(即 1 个季度)将被合并:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
我正在尝试通过将 start_time 和 end_time 列添加到 df1 来应用这个相关主题 [Merge pandas DataFrames based on irregular time intervals],表示 DATADATE 之后的 3 个月 (start_time) 到 6 个月 (end_time),然后使用 np.searchsorted(),但是这种情况有点棘手,因为我想逐个公司合并。
【问题讨论】:
【参考方案1】:这实际上是罕见的问题之一,其中算法复杂性可能因不同的解决方案而显着不同。您可能需要考虑这一点,而不是 1-liner sn-ps 的精巧。
算法:
根据日期对较大的数据框进行排序
对于较小数据框中的每个日期,使用bisect
模块在较大数据框中查找相关行
对于长度分别为 m 和 n 的数据帧 (m ),复杂度应该是 O(m log( n)).
【讨论】:
我按照您提供的步骤实施并在上面发布了我的代码。尽管我的大数据集需要很长时间,但它确实有效。我最初希望我能够通过 ['company','DATADATE'] 和 groupby.apply() 将 pandas groupby 合并到 df1 组中,并且 groupby.apply() 一个函数可以在 start_time 和 end_time 之间的 EventDates 中获取 df2 中的相关行df1 中的每一行(即 DATADATE 后 3-6 个月)。 这很有趣。当我有时间时,我实际上很乐意深入了解您的答案。【参考方案2】:这是我使用 Ami Tavory 建议的算法的解决方案:
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time
#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()
#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
#define empty df to append data
df3 = pd.DataFrame()
#iterate through each company in df1, find
#that company in sorted df2, then for each
#DATADATE quarter of df1, bisect df2 in the
#correct locations (i.e. start_time to end_time)
for cmpny in unique_companies_df1:
if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company
selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range
df_right = selected_df2.loc[lo:hi].copy() #grab all rows with EventDates that fall within our date range
df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
df_right.loc[0,'company']=cmpny
temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
df3=df3.append(temp)
【讨论】:
以上是关于熊猫:如何在偏移日期合并两个数据框?的主要内容,如果未能解决你的问题,请参考以下文章