如何为时间序列数据框添加行?
Posted
技术标签:
【中文标题】如何为时间序列数据框添加行?【英文标题】:How to add rows for a timeseries dataframe? 【发布时间】:2021-04-07 12:32:16 【问题描述】:我正在编写一个程序,它将时间序列 Excel 文件加载到数据框中,然后使用一些基本计算创建几个新列。我的程序有时会读取某些记录缺少几个月的 excel 文件。因此,在下面的示例中,我有两家不同商店的月度销售数据。这些商店在不同的月份营业,因此它们的第一个月底日期会有所不同。但两者都应该有截至 2020 年 9 月 30 日的月末数据。在我的文件中,Store BBB 没有 2020 年 8 月 31 日和 2020 年 9 月 30 日的记录,因为那几个月没有销售。
Store | Month Opened | State | City | Month End Date | Sales |
---|---|---|---|---|---|
AAA | 5/31/2020 | NY | New York | 5/31/2020 | 1000 |
AAA | 5/31/2020 | NY | New York | 6/30/2020 | 5000 |
AAA | 5/31/2020 | NY | New York | 7/30/2020 | 3000 |
AAA | 5/31/2020 | NY | New York | 8/31/2020 | 4000 |
AAA | 5/31/2020 | NY | New York | 9/30/2020 | 2000 |
BBB | 6/30/2020 | CT | Hartford | 6/30/2020 | 100 |
BBB | 6/30/2020 | CT | Hartford | 7/30/2020 | 200 |
因此,对于这样的任何情况,我希望能够为 8/31 和 9/30 的 Store BBB 添加两行。新行应使用与最近月末日期相同的“开业月份”、“州”和“城市”。两个新行的销售额都应设置为 0。截至目前,我执行以下步骤:
-
使用商店名称和每个商店的最大月末日期以及整个时间序列数据框的最大月末日期创建数据框“MaxDateData”,我将此字段命名为“最近日期”。
Store | Max Month End Date | Most Recent Date |
---|---|---|
AAA | 9/30/2020 | 9/30/2020 |
BBB | 7/30/2020 | 9/30/2020 |
-
使用主要时间序列数据帧中的最新行创建数据帧“MostRecent”。为此,我在时间序列数据框和商店名称和最大月结束日期上的 MaxDateData 之间进行了内部连接。
Store | Month Opened | State | City | Month End Date | Sales | Max Month End Date | Most Recent Date |
---|---|---|---|---|---|---|---|
AAA | 5/31/2020 | NY | New York | 9/30/2020 | 2000 | 9/30/2020 | 9/30/2020 |
BBB | 6/30/2020 | CT | Hartford | 7/30/2020 | 200 | 7/30/2020 | 9/30/2020 |
- 使用 where 子句创建数据框“RequireBackfill_MostRecent”,以筛选最大月结束日期
RequireBackfill_Stores_MostRecent = MaxDateData.where(MaxDateData['Max Month End Date'] <MaxDateData['Most Recent Date'])
RequireBackfill_MostRecent = MostRecent.merge(RequireBackfill_Stores_MostRecent,how='inner')
-
然后,我使用两个嵌套的 for 循环循环遍历需要填写的日期。它利用仅包含 Store BBB 的 RequireBackfill_MostRecent 数据框。
X=[]
end = MaxDateData['Most Recent Date'][0]
for i in MonthlyData['Month End Date'].unique():
per1 = pd.date_range(start = i, end = end, freq ='M')
for val in per1:
Data=[]
Data = RequireBackfill_MostRecent[["Store"
,"Month Opened"
,"City"
,"State"
]].where(RequireBackfill_MostRecent['Max Month End date']==i).dropna()
Data["Month End Date"]= val
Data["Sales"]= 0
X.append(Data)
NewData = pd.concat(X)
-
然后我使用 concat 将 NewData 添加到我的时间序列数据框中
FullData_List = [MonthlyData,NewData]
FullData=pd.concat(FullData_List)
整个过程有效,但有没有更有效的方法来做到这一点?当我开始处理更大的数据时,这可能会变得很昂贵。
【问题讨论】:
【参考方案1】:这是执行此操作的分步方法。如果您有任何问题,请告诉我。
import pandas as pd
pd.set_option('display.max_columns', None)
c = ['Store','Month Opened','State','City','Month End Date','Sales']
d = [['AAA','5/31/2020','NY','New York','5/31/2020',1000],
['AAA','5/31/2020','NY','New York','6/30/2020',5000],
['AAA','5/31/2020','NY','New York','7/30/2020',3000],
['AAA','5/31/2020','NY','New York','8/31/2020',4000],
['AAA','5/31/2020','NY','New York','9/30/2020',2000],
['BBB','6/30/2020','CT','Hartford','6/30/2020',100],
['BBB','6/30/2020','CT','Hartford','7/30/2020',200],
['CCC','3/31/2020','NJ','Cranbury','3/31/2020',1500]]
df = pd.DataFrame(d,columns = c)
df['Month Opened'] = pd.to_datetime(df['Month Opened'])
df['Month End Date'] = pd.to_datetime(df['Month End Date'])
#select last entry for each Store
df1 = df.sort_values('Month End Date').drop_duplicates('Store', keep='last').copy()
#delete all rows that have 2020-09-30. We want only ones that are less than 2020-09-30
df1 = df1[df1['Month End Date'] != '2020-09-30']
#set target end date to 2020-09-30
df1['Target_End_Date'] = pd.to_datetime ('2020-09-30')
#calculate how many rows to repeat
df1['repeats'] = df1['Target_End_Date'].dt.to_period('M').astype(int) - df1['Month End Date'].dt.to_period('M').astype(int)
#add 1 month to month end so we can start repeating from here
df1['Month End Date'] = df1['Month End Date'] + pd.DateOffset(months =1)
#set sales value as 0 per requirement
df1['Sales'] = 0
#repeat each row by the value in column repeats
df1 = df1.loc[df1.index.repeat(df1.repeats)].reset_index(drop=True)
#reset repeats to start from 0 thru n using groupby cumcouunt
#this will be used to calculate months to increment from month end date
df1['repeats'] = df1.groupby('Store').cumcount()
#update month end date based on value in repeats
df1['Month End Date'] = df1.apply(lambda x: x['Month End Date'] + pd.DateOffset(months = x['repeats']), axis=1)
#set end date to last day of the month
df1['Month End Date'] = pd.to_datetime(df1['Month End Date']) + pd.offsets.MonthEnd(0)
#drop columns that we don't need anymore. required before we concat dfs
df1.drop(columns=['Target_End_Date','repeats'],inplace=True)
#concat df and df1 to get the final dataframe
df = pd.concat([df, df1], ignore_index=True)
#sort values by Store and Month End Date
df = df.sort_values(by=['Store','Month End Date'],ignore_index=True)
print (df)
这个输出是:
Store Month Opened State City Month End Date Sales
0 AAA 2020-05-31 NY New York 2020-05-31 1000
1 AAA 2020-05-31 NY New York 2020-06-30 5000
2 AAA 2020-05-31 NY New York 2020-07-30 3000
3 AAA 2020-05-31 NY New York 2020-08-31 4000
4 AAA 2020-05-31 NY New York 2020-09-30 2000
5 BBB 2020-06-30 CT Hartford 2020-06-30 100
6 BBB 2020-06-30 CT Hartford 2020-07-30 200
7 BBB 2020-06-30 CT Hartford 2020-08-30 0
8 BBB 2020-06-30 CT Hartford 2020-09-30 0
9 CCC 2020-03-31 NJ Cranbury 2020-03-31 1500
10 CCC 2020-03-31 NJ Cranbury 2020-04-30 0
11 CCC 2020-03-31 NJ Cranbury 2020-05-31 0
12 CCC 2020-03-31 NJ Cranbury 2020-06-30 0
13 CCC 2020-03-31 NJ Cranbury 2020-07-31 0
14 CCC 2020-03-31 NJ Cranbury 2020-08-31 0
15 CCC 2020-03-31 NJ Cranbury 2020-09-30 0
请注意,我在 CCC 中添加了一个条目以向您展示更多变化。
【讨论】:
【参考方案2】:-
只需尝试日期时间索引的
upsample
。参考:pandas-resample-upsample-last-date-edge-of-data
# group by `Store`
# with `Month End Date` column show be converted to DateTime
group.set_index(['Month End Date']).resample('M').asfreq()
-
请注意:
7/30/2020
不是七月的结束日。 7/31/2020
是。所以使用7/30/2020
这个方法会有问题(将月结日期转换为真正的结束日期)。
【讨论】:
以上是关于如何为时间序列数据框添加行?的主要内容,如果未能解决你的问题,请参考以下文章