使用跨不同时间线和位置的多个条件创建新数据框

Posted

技术标签:

【中文标题】使用跨不同时间线和位置的多个条件创建新数据框【英文标题】:Create New Dataframe Using Multiple Conditions Across Different Timeline and Location 【发布时间】:2021-07-30 20:46:42 【问题描述】:

我有以下数据框有一个棘手的问题:

Disease  State       Month      Value
Covid    Texas     2020-03        2     
Covid    Texas     2020-04        3     
Covid    Texas     2020-05        4      
Covid    Texas     2020-08        3 
Cancer   Florida   2020-04        4     
Covid    Florida   2020-03        6      
Covid    Florida   2020-04        4      
Flu      Florida   2020-03        5         

我必须连续 3 个月列出值并创建一个新数据框。 但是,有一些条件:

    将为每种疾病、每个月(从开始到结束:2020 年 2 月至 2021 年 4 月)和每个州创建列表。

    如果数据集中没有任何特定月份,则会创建该月的行,该月的值为 0。

期望的输出:

Disease State    Month      ValueList
Covid   Texas    2020-02    [0, 2, 3] (no dataset for Feb 20 but next two months are) 
Covid   Texas    2020-03    [2, 3, 4] (has values for 3 consecutive months)
Covid   Texas    2020-04    [3, 4, 0] (doesn’t have value for 6th month)   
Covid   Texas    2020-05    [4, 0, 0] (has value for present month)
Covid   Texas    2020-06    [0, 0, 3] (has value for 8th month)
Covid   Texas    2020-07    [0, 3, 0] (has value for 8th month)
Covid   Texas    2020-08    [3, 0, 0] (has value for present month)
Covid   Texas    2020-09    [0, 0, 0] (no dataset for next 3 months)  
Covid   Texas    2020-10    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-11    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-12    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-01    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-02    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-03    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-04    [0, 0, 0] (no dataset for next 3 months)

我正在尝试使用这个来填写日期:

df3= (df2.set_index('MonthEnd')
   .groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
   .reset_index())

但是,它不会为每个组返回相同的时间范围。它返回该组中最小和最大日期之间的值。

我不确定我应该如何开始。任何帮助,将不胜感激。谢谢!

【问题讨论】:

查看 groupby() 和 ***.com/questions/19324453/… 以填写缺失的日期。您可以使用 apply() 和 groupby() 为每个组添加缺失的日期。一旦分组和添加日期,您必须迭代并选择每三行:对您的 Valuelist 列使用类似 df.Value.tolist() 的内容 @JonathanLeon:感谢您的回复。您能否也分享其他示例。 很遗憾,目前不是。您在问题中有很多个人问题。从搜索 groupby 和 apply 开始,学习如何迭代和应用函数。我建议您自己尝试,并在流程的每个部分提出问题,显示您尝试过的地方。人们更倾向于帮助修改代码,而不仅仅是提供代码。 我已经添加了逻辑。可能有比我提供的更好的解决方案,但逻辑将保持不变。 【参考方案1】:

让我们从简单的逻辑开始。所以基本上你想为每个组创建从Feb 2020Apr 2021 的日期范围。

让我们选取每个组并使用重新索引添加此日期范围。完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个值和当前值)并将其转换为列表。

我会将这些列表值列表分配给我的ValueList 列。 然后我会将所有这些修改后的组添加到数据框中。

解决方案:

df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)

def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows. 
    n = (3-len(li))
    if n<3:
        li = [0]*n +li
    return li


start = '2020-02'
end = '2021-04'

data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    data = data.append(grp)

使用apply:

def fill_date(grp):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    return grp

 data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)

数据:

Disease State Value ValueList
2020-02-01 Cancer Florida 0 [0, 0, 0]
2020-03-01 Cancer Florida 0 [0, 0, 0]
2020-04-01 Cancer Florida 4 [0, 0, 4]
2020-05-01 Cancer Florida 0 [0, 4, 0]
2020-06-01 Cancer Florida 0 [4, 0, 0]
2020-07-01 Cancer Florida 0 [0, 0, 0]
2020-08-01 Cancer Florida 0 [0, 0, 0]
2020-09-01 Cancer Florida 0 [0, 0, 0]
2020-10-01 Cancer Florida 0 [0, 0, 0]
2020-11-01 Cancer Florida 0 [0, 0, 0]
2020-12-01 Cancer Florida 0 [0, 0, 0]
2021-01-01 Cancer Florida 0 [0, 0, 0]
2021-02-01 Cancer Florida 0 [0, 0, 0]
2021-03-01 Cancer Florida 0 [0, 0, 0]
2021-04-01 Cancer Florida 0 [0, 0, 0]
2020-02-01 Covid Florida 0 [0, 0, 0]
2020-03-01 Covid Florida 6 [0, 0, 6]
2020-04-01 Covid Florida 4 [0, 6, 4]
2020-05-01 Covid Florida 0 [6, 4, 0]
2020-06-01 Covid Florida 0 [4, 0, 0]
2020-07-01 Covid Florida 0 [0, 0, 0]
2020-08-01 Covid Florida 0 [0, 0, 0]
2020-09-01 Covid Florida 0 [0, 0, 0]
2020-10-01 Covid Florida 0 [0, 0, 0]
2020-11-01 Covid Florida 0 [0, 0, 0]
2020-12-01 Covid Florida 0 [0, 0, 0]
2021-01-01 Covid Florida 0 [0, 0, 0]
2021-02-01 Covid Florida 0 [0, 0, 0]
2021-03-01 Covid Florida 0 [0, 0, 0]
2021-04-01 Covid Florida 0 [0, 0, 0]
2020-02-01 Covid Texas 0 [0, 0, 0]
2020-03-01 Covid Texas 2 [0, 0, 2]
2020-04-01 Covid Texas 3 [0, 2, 3]
2020-05-01 Covid Texas 4 [2, 3, 4]
2020-06-01 Covid Texas 0 [3, 4, 0]
2020-07-01 Covid Texas 0 [4, 0, 0]
2020-08-01 Covid Texas 3 [0, 0, 3]
2020-09-01 Covid Texas 0 [0, 3, 0]
2020-10-01 Covid Texas 0 [3, 0, 0]
2020-11-01 Covid Texas 0 [0, 0, 0]
2020-12-01 Covid Texas 0 [0, 0, 0]
2021-01-01 Covid Texas 0 [0, 0, 0]
2021-02-01 Covid Texas 0 [0, 0, 0]
2021-03-01 Covid Texas 0 [0, 0, 0]
2021-04-01 Covid Texas 0 [0, 0, 0]
2020-02-01 Flu Florida 0 [0, 0, 0]
2020-03-01 Flu Florida 5 [0, 0, 5]
2020-04-01 Flu Florida 0 [0, 5, 0]
2020-05-01 Flu Florida 0 [5, 0, 0]
2020-06-01 Flu Florida 0 [0, 0, 0]
2020-07-01 Flu Florida 0 [0, 0, 0]
2020-08-01 Flu Florida 0 [0, 0, 0]
2020-09-01 Flu Florida 0 [0, 0, 0]
2020-10-01 Flu Florida 0 [0, 0, 0]
2020-11-01 Flu Florida 0 [0, 0, 0]
2020-12-01 Flu Florida 0 [0, 0, 0]
2021-01-01 Flu Florida 0 [0, 0, 0]
2021-02-01 Flu Florida 0 [0, 0, 0]
2021-03-01 Flu Florida 0 [0, 0, 0]
2021-04-01 Flu Florida 0 [0, 0, 0]

【讨论】:

嗨@Pygirl,再次感谢您提供详细解释。查找 TypeError:传递 PeriodDtype 数据无效。请改用data.to_timestamp() @Roy:参考这个:***.com/questions/59316865/…【参考方案2】:

您可以使用pandas.date_range() 生成 2020 年 2 月至 2021 年 4 月之间的日期列表。

dates = pd.date_range('2020-02', '2021-04', freq='MS').strftime('%Y-%m')

然后按DiseaseState 列分组,并在每个组中填充缺失的部分。

def fill_missing(group):
    group = group.merge(pd.DataFrame('Month': dates), how='right')
    group[['Disease', 'State']] = group[['Disease', 'State']].ffill().bfill()
    group['Value'] = group['Value'].fillna(0)

    group['ValueList'] = [[a, b, c] for a, b, c in zip(group['Value'].astype(int), group['Value'].shift(-1).fillna(0).astype(int), group['Value'].shift(-2).fillna(0).astype(int))]

    return group

df_ = df.groupby(['Disease', 'State']).apply(fill_missing).reset_index(drop=True)
print(df_)

   Disease    State    Month  Value  ValueList
0   Cancer  Florida  2020-02    0.0  [0, 0, 4]
1   Cancer  Florida  2020-03    0.0  [0, 4, 0]
2   Cancer  Florida  2020-04    4.0  [4, 0, 0]
3   Cancer  Florida  2020-05    0.0  [0, 0, 0]
4   Cancer  Florida  2020-06    0.0  [0, 0, 0]
5   Cancer  Florida  2020-07    0.0  [0, 0, 0]
6   Cancer  Florida  2020-08    0.0  [0, 0, 0]
7   Cancer  Florida  2020-09    0.0  [0, 0, 0]
8   Cancer  Florida  2020-10    0.0  [0, 0, 0]
9   Cancer  Florida  2020-11    0.0  [0, 0, 0]
10  Cancer  Florida  2020-12    0.0  [0, 0, 0]
11  Cancer  Florida  2021-01    0.0  [0, 0, 0]
12  Cancer  Florida  2021-02    0.0  [0, 0, 0]
13  Cancer  Florida  2021-03    0.0  [0, 0, 0]
14  Cancer  Florida  2021-04    0.0  [0, 0, 0]
15   Covid  Florida  2020-02    0.0  [0, 6, 4]
16   Covid  Florida  2020-03    6.0  [6, 4, 0]
17   Covid  Florida  2020-04    4.0  [4, 0, 0]
18   Covid  Florida  2020-05    0.0  [0, 0, 0]
19   Covid  Florida  2020-06    0.0  [0, 0, 0]
20   Covid  Florida  2020-07    0.0  [0, 0, 0]
21   Covid  Florida  2020-08    0.0  [0, 0, 0]
22   Covid  Florida  2020-09    0.0  [0, 0, 0]
23   Covid  Florida  2020-10    0.0  [0, 0, 0]
24   Covid  Florida  2020-11    0.0  [0, 0, 0]
25   Covid  Florida  2020-12    0.0  [0, 0, 0]
26   Covid  Florida  2021-01    0.0  [0, 0, 0]
27   Covid  Florida  2021-02    0.0  [0, 0, 0]
28   Covid  Florida  2021-03    0.0  [0, 0, 0]
29   Covid  Florida  2021-04    0.0  [0, 0, 0]
30   Covid    Texas  2020-02    0.0  [0, 2, 3]
31   Covid    Texas  2020-03    2.0  [2, 3, 4]
32   Covid    Texas  2020-04    3.0  [3, 4, 0]
33   Covid    Texas  2020-05    4.0  [4, 0, 0]
34   Covid    Texas  2020-06    0.0  [0, 0, 3]
35   Covid    Texas  2020-07    0.0  [0, 3, 0]
36   Covid    Texas  2020-08    3.0  [3, 0, 0]
37   Covid    Texas  2020-09    0.0  [0, 0, 0]
38   Covid    Texas  2020-10    0.0  [0, 0, 0]
39   Covid    Texas  2020-11    0.0  [0, 0, 0]
40   Covid    Texas  2020-12    0.0  [0, 0, 0]
41   Covid    Texas  2021-01    0.0  [0, 0, 0]
42   Covid    Texas  2021-02    0.0  [0, 0, 0]
43   Covid    Texas  2021-03    0.0  [0, 0, 0]
44   Covid    Texas  2021-04    0.0  [0, 0, 0]
45     Flu  Florida  2020-02    0.0  [0, 5, 0]
46     Flu  Florida  2020-03    5.0  [5, 0, 0]
47     Flu  Florida  2020-04    0.0  [0, 0, 0]
48     Flu  Florida  2020-05    0.0  [0, 0, 0]
49     Flu  Florida  2020-06    0.0  [0, 0, 0]
50     Flu  Florida  2020-07    0.0  [0, 0, 0]
51     Flu  Florida  2020-08    0.0  [0, 0, 0]
52     Flu  Florida  2020-09    0.0  [0, 0, 0]
53     Flu  Florida  2020-10    0.0  [0, 0, 0]
54     Flu  Florida  2020-11    0.0  [0, 0, 0]
55     Flu  Florida  2020-12    0.0  [0, 0, 0]
56     Flu  Florida  2021-01    0.0  [0, 0, 0]
57     Flu  Florida  2021-02    0.0  [0, 0, 0]
58     Flu  Florida  2021-03    0.0  [0, 0, 0]
59     Flu  Florida  2021-04    0.0  [0, 0, 0]

【讨论】:

嗨@Ynjxsjmh。太感谢了。逻辑确实令人印象深刻。在这里,我发现 ValueError: You are trying to merge on period[M] 和 object 列。如果你想继续,你应该使用 pd.concat @Roy 可能正在将您的 Month 列转换为带有 df['Month'] = df['Month'].astype(str) 的字符串。

以上是关于使用跨不同时间线和位置的多个条件创建新数据框的主要内容,如果未能解决你的问题,请参考以下文章

从旧数据帧创建一个新数据帧,其中新数据帧包含旧数据帧中不同位置的列的行平均

Firebase - 跨多个位置的原子写入不起作用

跨数据库表中的行验证给定条件

跨多个数据框计算新列

如何在不创建实际新按钮的情况下创建在不同位置执行相同操作的多个按钮?

是否可以在 MATLAB 中跨多个图形同步数据游标?