使用跨不同时间线和位置的多个条件创建新数据框
Posted
技术标签:
【中文标题】使用跨不同时间线和位置的多个条件创建新数据框【英文标题】:Create New Dataframe Using Multiple Conditions Across Different Timeline and Location 【发布时间】:2021-07-30 20:46:42 【问题描述】:我有以下数据框有一个棘手的问题:
Disease State Month Value
Covid Texas 2020-03 2
Covid Texas 2020-04 3
Covid Texas 2020-05 4
Covid Texas 2020-08 3
Cancer Florida 2020-04 4
Covid Florida 2020-03 6
Covid Florida 2020-04 4
Flu Florida 2020-03 5
我必须连续 3 个月列出值并创建一个新数据框。 但是,有一些条件:
将为每种疾病、每个月(从开始到结束:2020 年 2 月至 2021 年 4 月)和每个州创建列表。
如果数据集中没有任何特定月份,则会创建该月的行,该月的值为 0。
期望的输出:
Disease State Month ValueList
Covid Texas 2020-02 [0, 2, 3] (no dataset for Feb 20 but next two months are)
Covid Texas 2020-03 [2, 3, 4] (has values for 3 consecutive months)
Covid Texas 2020-04 [3, 4, 0] (doesn’t have value for 6th month)
Covid Texas 2020-05 [4, 0, 0] (has value for present month)
Covid Texas 2020-06 [0, 0, 3] (has value for 8th month)
Covid Texas 2020-07 [0, 3, 0] (has value for 8th month)
Covid Texas 2020-08 [3, 0, 0] (has value for present month)
Covid Texas 2020-09 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-10 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-11 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2020-12 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-01 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-02 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-03 [0, 0, 0] (no dataset for next 3 months)
Covid Texas 2021-04 [0, 0, 0] (no dataset for next 3 months)
我正在尝试使用这个来填写日期:
df3= (df2.set_index('MonthEnd')
.groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
.reset_index())
但是,它不会为每个组返回相同的时间范围。它返回该组中最小和最大日期之间的值。
我不确定我应该如何开始。任何帮助,将不胜感激。谢谢!
【问题讨论】:
查看 groupby() 和 ***.com/questions/19324453/… 以填写缺失的日期。您可以使用 apply() 和 groupby() 为每个组添加缺失的日期。一旦分组和添加日期,您必须迭代并选择每三行:对您的 Valuelist 列使用类似 df.Value.tolist() 的内容 @JonathanLeon:感谢您的回复。您能否也分享其他示例。 很遗憾,目前不是。您在问题中有很多个人问题。从搜索 groupby 和 apply 开始,学习如何迭代和应用函数。我建议您自己尝试,并在流程的每个部分提出问题,显示您尝试过的地方。人们更倾向于帮助修改代码,而不仅仅是提供代码。 我已经添加了逻辑。可能有比我提供的更好的解决方案,但逻辑将保持不变。 【参考方案1】:让我们从简单的逻辑开始。所以基本上你想为每个组创建从Feb 2020
到Apr 2021
的日期范围。
让我们选取每个组并使用重新索引添加此日期范围。完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个值和当前值)并将其转换为列表。
我会将这些列表值列表分配给我的ValueList
列。
然后我会将所有这些修改后的组添加到数据框中。
解决方案:
df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)
def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows.
n = (3-len(li))
if n<3:
li = [0]*n +li
return li
start = '2020-02'
end = '2021-04'
data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
grp = (grp.fillna(0))
grp['Value'] = grp['Value'].astype(int)
grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
data = data.append(grp)
或
使用apply
:
def fill_date(grp):
grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
grp = (grp.fillna(0))
grp['Value'] = grp['Value'].astype(int)
grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
return grp
data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)
数据:
Disease | State | Value | ValueList | |
---|---|---|---|---|
2020-02-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-03-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-04-01 | Cancer | Florida | 4 | [0, 0, 4] |
2020-05-01 | Cancer | Florida | 0 | [0, 4, 0] |
2020-06-01 | Cancer | Florida | 0 | [4, 0, 0] |
2020-07-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-08-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-09-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-10-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-11-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-12-01 | Cancer | Florida | 0 | [0, 0, 0] |
2021-01-01 | Cancer | Florida | 0 | [0, 0, 0] |
2021-02-01 | Cancer | Florida | 0 | [0, 0, 0] |
2021-03-01 | Cancer | Florida | 0 | [0, 0, 0] |
2021-04-01 | Cancer | Florida | 0 | [0, 0, 0] |
2020-02-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-03-01 | Covid | Florida | 6 | [0, 0, 6] |
2020-04-01 | Covid | Florida | 4 | [0, 6, 4] |
2020-05-01 | Covid | Florida | 0 | [6, 4, 0] |
2020-06-01 | Covid | Florida | 0 | [4, 0, 0] |
2020-07-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-08-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-09-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-10-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-11-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-12-01 | Covid | Florida | 0 | [0, 0, 0] |
2021-01-01 | Covid | Florida | 0 | [0, 0, 0] |
2021-02-01 | Covid | Florida | 0 | [0, 0, 0] |
2021-03-01 | Covid | Florida | 0 | [0, 0, 0] |
2021-04-01 | Covid | Florida | 0 | [0, 0, 0] |
2020-02-01 | Covid | Texas | 0 | [0, 0, 0] |
2020-03-01 | Covid | Texas | 2 | [0, 0, 2] |
2020-04-01 | Covid | Texas | 3 | [0, 2, 3] |
2020-05-01 | Covid | Texas | 4 | [2, 3, 4] |
2020-06-01 | Covid | Texas | 0 | [3, 4, 0] |
2020-07-01 | Covid | Texas | 0 | [4, 0, 0] |
2020-08-01 | Covid | Texas | 3 | [0, 0, 3] |
2020-09-01 | Covid | Texas | 0 | [0, 3, 0] |
2020-10-01 | Covid | Texas | 0 | [3, 0, 0] |
2020-11-01 | Covid | Texas | 0 | [0, 0, 0] |
2020-12-01 | Covid | Texas | 0 | [0, 0, 0] |
2021-01-01 | Covid | Texas | 0 | [0, 0, 0] |
2021-02-01 | Covid | Texas | 0 | [0, 0, 0] |
2021-03-01 | Covid | Texas | 0 | [0, 0, 0] |
2021-04-01 | Covid | Texas | 0 | [0, 0, 0] |
2020-02-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-03-01 | Flu | Florida | 5 | [0, 0, 5] |
2020-04-01 | Flu | Florida | 0 | [0, 5, 0] |
2020-05-01 | Flu | Florida | 0 | [5, 0, 0] |
2020-06-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-07-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-08-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-09-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-10-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-11-01 | Flu | Florida | 0 | [0, 0, 0] |
2020-12-01 | Flu | Florida | 0 | [0, 0, 0] |
2021-01-01 | Flu | Florida | 0 | [0, 0, 0] |
2021-02-01 | Flu | Florida | 0 | [0, 0, 0] |
2021-03-01 | Flu | Florida | 0 | [0, 0, 0] |
2021-04-01 | Flu | Florida | 0 | [0, 0, 0] |
【讨论】:
嗨@Pygirl,再次感谢您提供详细解释。查找 TypeError:传递 PeriodDtype 数据无效。请改用data.to_timestamp()
@Roy:参考这个:***.com/questions/59316865/…【参考方案2】:
您可以使用pandas.date_range()
生成 2020 年 2 月至 2021 年 4 月之间的日期列表。
dates = pd.date_range('2020-02', '2021-04', freq='MS').strftime('%Y-%m')
然后按Disease
和State
列分组,并在每个组中填充缺失的部分。
def fill_missing(group):
group = group.merge(pd.DataFrame('Month': dates), how='right')
group[['Disease', 'State']] = group[['Disease', 'State']].ffill().bfill()
group['Value'] = group['Value'].fillna(0)
group['ValueList'] = [[a, b, c] for a, b, c in zip(group['Value'].astype(int), group['Value'].shift(-1).fillna(0).astype(int), group['Value'].shift(-2).fillna(0).astype(int))]
return group
df_ = df.groupby(['Disease', 'State']).apply(fill_missing).reset_index(drop=True)
print(df_)
Disease State Month Value ValueList
0 Cancer Florida 2020-02 0.0 [0, 0, 4]
1 Cancer Florida 2020-03 0.0 [0, 4, 0]
2 Cancer Florida 2020-04 4.0 [4, 0, 0]
3 Cancer Florida 2020-05 0.0 [0, 0, 0]
4 Cancer Florida 2020-06 0.0 [0, 0, 0]
5 Cancer Florida 2020-07 0.0 [0, 0, 0]
6 Cancer Florida 2020-08 0.0 [0, 0, 0]
7 Cancer Florida 2020-09 0.0 [0, 0, 0]
8 Cancer Florida 2020-10 0.0 [0, 0, 0]
9 Cancer Florida 2020-11 0.0 [0, 0, 0]
10 Cancer Florida 2020-12 0.0 [0, 0, 0]
11 Cancer Florida 2021-01 0.0 [0, 0, 0]
12 Cancer Florida 2021-02 0.0 [0, 0, 0]
13 Cancer Florida 2021-03 0.0 [0, 0, 0]
14 Cancer Florida 2021-04 0.0 [0, 0, 0]
15 Covid Florida 2020-02 0.0 [0, 6, 4]
16 Covid Florida 2020-03 6.0 [6, 4, 0]
17 Covid Florida 2020-04 4.0 [4, 0, 0]
18 Covid Florida 2020-05 0.0 [0, 0, 0]
19 Covid Florida 2020-06 0.0 [0, 0, 0]
20 Covid Florida 2020-07 0.0 [0, 0, 0]
21 Covid Florida 2020-08 0.0 [0, 0, 0]
22 Covid Florida 2020-09 0.0 [0, 0, 0]
23 Covid Florida 2020-10 0.0 [0, 0, 0]
24 Covid Florida 2020-11 0.0 [0, 0, 0]
25 Covid Florida 2020-12 0.0 [0, 0, 0]
26 Covid Florida 2021-01 0.0 [0, 0, 0]
27 Covid Florida 2021-02 0.0 [0, 0, 0]
28 Covid Florida 2021-03 0.0 [0, 0, 0]
29 Covid Florida 2021-04 0.0 [0, 0, 0]
30 Covid Texas 2020-02 0.0 [0, 2, 3]
31 Covid Texas 2020-03 2.0 [2, 3, 4]
32 Covid Texas 2020-04 3.0 [3, 4, 0]
33 Covid Texas 2020-05 4.0 [4, 0, 0]
34 Covid Texas 2020-06 0.0 [0, 0, 3]
35 Covid Texas 2020-07 0.0 [0, 3, 0]
36 Covid Texas 2020-08 3.0 [3, 0, 0]
37 Covid Texas 2020-09 0.0 [0, 0, 0]
38 Covid Texas 2020-10 0.0 [0, 0, 0]
39 Covid Texas 2020-11 0.0 [0, 0, 0]
40 Covid Texas 2020-12 0.0 [0, 0, 0]
41 Covid Texas 2021-01 0.0 [0, 0, 0]
42 Covid Texas 2021-02 0.0 [0, 0, 0]
43 Covid Texas 2021-03 0.0 [0, 0, 0]
44 Covid Texas 2021-04 0.0 [0, 0, 0]
45 Flu Florida 2020-02 0.0 [0, 5, 0]
46 Flu Florida 2020-03 5.0 [5, 0, 0]
47 Flu Florida 2020-04 0.0 [0, 0, 0]
48 Flu Florida 2020-05 0.0 [0, 0, 0]
49 Flu Florida 2020-06 0.0 [0, 0, 0]
50 Flu Florida 2020-07 0.0 [0, 0, 0]
51 Flu Florida 2020-08 0.0 [0, 0, 0]
52 Flu Florida 2020-09 0.0 [0, 0, 0]
53 Flu Florida 2020-10 0.0 [0, 0, 0]
54 Flu Florida 2020-11 0.0 [0, 0, 0]
55 Flu Florida 2020-12 0.0 [0, 0, 0]
56 Flu Florida 2021-01 0.0 [0, 0, 0]
57 Flu Florida 2021-02 0.0 [0, 0, 0]
58 Flu Florida 2021-03 0.0 [0, 0, 0]
59 Flu Florida 2021-04 0.0 [0, 0, 0]
【讨论】:
嗨@Ynjxsjmh。太感谢了。逻辑确实令人印象深刻。在这里,我发现 ValueError: You are trying to merge on period[M] 和 object 列。如果你想继续,你应该使用 pd.concat @Roy 可能正在将您的Month
列转换为带有 df['Month'] = df['Month'].astype(str)
的字符串。以上是关于使用跨不同时间线和位置的多个条件创建新数据框的主要内容,如果未能解决你的问题,请参考以下文章
从旧数据帧创建一个新数据帧,其中新数据帧包含旧数据帧中不同位置的列的行平均