


【中文标题】使用跨不同时间线和位置的多个条件创建新数据框【英文标题】:Create New Dataframe Using Multiple Conditions Across Different Timeline and Location 【发布时间】:2021-07-30 20:46:42 【问题描述】:


Disease  State       Month      Value
Covid    Texas     2020-03        2     
Covid    Texas     2020-04        3     
Covid    Texas     2020-05        4      
Covid    Texas     2020-08        3 
Cancer   Florida   2020-04        4     
Covid    Florida   2020-03        6      
Covid    Florida   2020-04        4      
Flu      Florida   2020-03        5         

我必须连续 3 个月列出值并创建一个新数据框。 但是,有一些条件:

    将为每种疾病、每个月(从开始到结束:2020 年 2 月至 2021 年 4 月)和每个州创建列表。

    如果数据集中没有任何特定月份,则会创建该月的行,该月的值为 0。


Disease State    Month      ValueList
Covid   Texas    2020-02    [0, 2, 3] (no dataset for Feb 20 but next two months are) 
Covid   Texas    2020-03    [2, 3, 4] (has values for 3 consecutive months)
Covid   Texas    2020-04    [3, 4, 0] (doesn’t have value for 6th month)   
Covid   Texas    2020-05    [4, 0, 0] (has value for present month)
Covid   Texas    2020-06    [0, 0, 3] (has value for 8th month)
Covid   Texas    2020-07    [0, 3, 0] (has value for 8th month)
Covid   Texas    2020-08    [3, 0, 0] (has value for present month)
Covid   Texas    2020-09    [0, 0, 0] (no dataset for next 3 months)  
Covid   Texas    2020-10    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-11    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-12    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-01    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-02    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-03    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-04    [0, 0, 0] (no dataset for next 3 months)


df3= (df2.set_index('MonthEnd')
   .groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))




查看 groupby() 和 ***.com/questions/19324453/… 以填写缺失的日期。您可以使用 apply() 和 groupby() 为每个组添加缺失的日期。一旦分组和添加日期,您必须迭代并选择每三行:对您的 Valuelist 列使用类似 df.Value.tolist() 的内容 @JonathanLeon:感谢您的回复。您能否也分享其他示例。 很遗憾,目前不是。您在问题中有很多个人问题。从搜索 groupby 和 apply 开始,学习如何迭代和应用函数。我建议您自己尝试,并在流程的每个部分提出问题,显示您尝试过的地方。人们更倾向于帮助修改代码,而不仅仅是提供代码。 我已经添加了逻辑。可能有比我提供的更好的解决方案,但逻辑将保持不变。 【参考方案1】:

让我们从简单的逻辑开始。所以基本上你想为每个组创建从Feb 2020Apr 2021 的日期范围。

让我们选取每个组并使用重新索引添加此日期范围。完成添加日期范围后,我将填充数据,然后执行滚动功能以获取 3 个连续值(考虑前一个值和当前值)并将其转换为列表。

我会将这些列表值列表分配给我的ValueList 列。 然后我会将所有这些修改后的组添加到数据框中。


df.Month = pd.to_datetime(df.Month, format="%Y-%m")

def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows. 
    n = (3-len(li))
    if n<3:
        li = [0]*n +li
    return li

start = '2020-02'
end = '2021-04'

data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    data = data.append(grp)


def fill_date(grp):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    return grp

 data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)


Disease State Value ValueList
2020-02-01 Cancer Florida 0 [0, 0, 0]
2020-03-01 Cancer Florida 0 [0, 0, 0]
2020-04-01 Cancer Florida 4 [0, 0, 4]
2020-05-01 Cancer Florida 0 [0, 4, 0]
2020-06-01 Cancer Florida 0 [4, 0, 0]
2020-07-01 Cancer Florida 0 [0, 0, 0]
2020-08-01 Cancer Florida 0 [0, 0, 0]
2020-09-01 Cancer Florida 0 [0, 0, 0]
2020-10-01 Cancer Florida 0 [0, 0, 0]
2020-11-01 Cancer Florida 0 [0, 0, 0]
2020-12-01 Cancer Florida 0 [0, 0, 0]
2021-01-01 Cancer Florida 0 [0, 0, 0]
2021-02-01 Cancer Florida 0 [0, 0, 0]
2021-03-01 Cancer Florida 0 [0, 0, 0]
2021-04-01 Cancer Florida 0 [0, 0, 0]
2020-02-01 Covid Florida 0 [0, 0, 0]
2020-03-01 Covid Florida 6 [0, 0, 6]
2020-04-01 Covid Florida 4 [0, 6, 4]
2020-05-01 Covid Florida 0 [6, 4, 0]
2020-06-01 Covid Florida 0 [4, 0, 0]
2020-07-01 Covid Florida 0 [0, 0, 0]
2020-08-01 Covid Florida 0 [0, 0, 0]
2020-09-01 Covid Florida 0 [0, 0, 0]
2020-10-01 Covid Florida 0 [0, 0, 0]
2020-11-01 Covid Florida 0 [0, 0, 0]
2020-12-01 Covid Florida 0 [0, 0, 0]
2021-01-01 Covid Florida 0 [0, 0, 0]
2021-02-01 Covid Florida 0 [0, 0, 0]
2021-03-01 Covid Florida 0 [0, 0, 0]
2021-04-01 Covid Florida 0 [0, 0, 0]
2020-02-01 Covid Texas 0 [0, 0, 0]
2020-03-01 Covid Texas 2 [0, 0, 2]
2020-04-01 Covid Texas 3 [0, 2, 3]
2020-05-01 Covid Texas 4 [2, 3, 4]
2020-06-01 Covid Texas 0 [3, 4, 0]
2020-07-01 Covid Texas 0 [4, 0, 0]
2020-08-01 Covid Texas 3 [0, 0, 3]
2020-09-01 Covid Texas 0 [0, 3, 0]
2020-10-01 Covid Texas 0 [3, 0, 0]
2020-11-01 Covid Texas 0 [0, 0, 0]
2020-12-01 Covid Texas 0 [0, 0, 0]
2021-01-01 Covid Texas 0 [0, 0, 0]
2021-02-01 Covid Texas 0 [0, 0, 0]
2021-03-01 Covid Texas 0 [0, 0, 0]
2021-04-01 Covid Texas 0 [0, 0, 0]
2020-02-01 Flu Florida 0 [0, 0, 0]
2020-03-01 Flu Florida 5 [0, 0, 5]
2020-04-01 Flu Florida 0 [0, 5, 0]
2020-05-01 Flu Florida 0 [5, 0, 0]
2020-06-01 Flu Florida 0 [0, 0, 0]
2020-07-01 Flu Florida 0 [0, 0, 0]
2020-08-01 Flu Florida 0 [0, 0, 0]
2020-09-01 Flu Florida 0 [0, 0, 0]
2020-10-01 Flu Florida 0 [0, 0, 0]
2020-11-01 Flu Florida 0 [0, 0, 0]
2020-12-01 Flu Florida 0 [0, 0, 0]
2021-01-01 Flu Florida 0 [0, 0, 0]
2021-02-01 Flu Florida 0 [0, 0, 0]
2021-03-01 Flu Florida 0 [0, 0, 0]
2021-04-01 Flu Florida 0 [0, 0, 0]


嗨@Pygirl,再次感谢您提供详细解释。查找 TypeError:传递 PeriodDtype 数据无效。请改用data.to_timestamp() @Roy:参考这个:***.com/questions/59316865/…【参考方案2】:

您可以使用pandas.date_range() 生成 2020 年 2 月至 2021 年 4 月之间的日期列表。

dates = pd.date_range('2020-02', '2021-04', freq='MS').strftime('%Y-%m')

然后按DiseaseState 列分组,并在每个组中填充缺失的部分。

def fill_missing(group):
    group = group.merge(pd.DataFrame('Month': dates), how='right')
    group[['Disease', 'State']] = group[['Disease', 'State']].ffill().bfill()
    group['Value'] = group['Value'].fillna(0)

    group['ValueList'] = [[a, b, c] for a, b, c in zip(group['Value'].astype(int), group['Value'].shift(-1).fillna(0).astype(int), group['Value'].shift(-2).fillna(0).astype(int))]

    return group

df_ = df.groupby(['Disease', 'State']).apply(fill_missing).reset_index(drop=True)

   Disease    State    Month  Value  ValueList
0   Cancer  Florida  2020-02    0.0  [0, 0, 4]
1   Cancer  Florida  2020-03    0.0  [0, 4, 0]
2   Cancer  Florida  2020-04    4.0  [4, 0, 0]
3   Cancer  Florida  2020-05    0.0  [0, 0, 0]
4   Cancer  Florida  2020-06    0.0  [0, 0, 0]
5   Cancer  Florida  2020-07    0.0  [0, 0, 0]
6   Cancer  Florida  2020-08    0.0  [0, 0, 0]
7   Cancer  Florida  2020-09    0.0  [0, 0, 0]
8   Cancer  Florida  2020-10    0.0  [0, 0, 0]
9   Cancer  Florida  2020-11    0.0  [0, 0, 0]
10  Cancer  Florida  2020-12    0.0  [0, 0, 0]
11  Cancer  Florida  2021-01    0.0  [0, 0, 0]
12  Cancer  Florida  2021-02    0.0  [0, 0, 0]
13  Cancer  Florida  2021-03    0.0  [0, 0, 0]
14  Cancer  Florida  2021-04    0.0  [0, 0, 0]
15   Covid  Florida  2020-02    0.0  [0, 6, 4]
16   Covid  Florida  2020-03    6.0  [6, 4, 0]
17   Covid  Florida  2020-04    4.0  [4, 0, 0]
18   Covid  Florida  2020-05    0.0  [0, 0, 0]
19   Covid  Florida  2020-06    0.0  [0, 0, 0]
20   Covid  Florida  2020-07    0.0  [0, 0, 0]
21   Covid  Florida  2020-08    0.0  [0, 0, 0]
22   Covid  Florida  2020-09    0.0  [0, 0, 0]
23   Covid  Florida  2020-10    0.0  [0, 0, 0]
24   Covid  Florida  2020-11    0.0  [0, 0, 0]
25   Covid  Florida  2020-12    0.0  [0, 0, 0]
26   Covid  Florida  2021-01    0.0  [0, 0, 0]
27   Covid  Florida  2021-02    0.0  [0, 0, 0]
28   Covid  Florida  2021-03    0.0  [0, 0, 0]
29   Covid  Florida  2021-04    0.0  [0, 0, 0]
30   Covid    Texas  2020-02    0.0  [0, 2, 3]
31   Covid    Texas  2020-03    2.0  [2, 3, 4]
32   Covid    Texas  2020-04    3.0  [3, 4, 0]
33   Covid    Texas  2020-05    4.0  [4, 0, 0]
34   Covid    Texas  2020-06    0.0  [0, 0, 3]
35   Covid    Texas  2020-07    0.0  [0, 3, 0]
36   Covid    Texas  2020-08    3.0  [3, 0, 0]
37   Covid    Texas  2020-09    0.0  [0, 0, 0]
38   Covid    Texas  2020-10    0.0  [0, 0, 0]
39   Covid    Texas  2020-11    0.0  [0, 0, 0]
40   Covid    Texas  2020-12    0.0  [0, 0, 0]
41   Covid    Texas  2021-01    0.0  [0, 0, 0]
42   Covid    Texas  2021-02    0.0  [0, 0, 0]
43   Covid    Texas  2021-03    0.0  [0, 0, 0]
44   Covid    Texas  2021-04    0.0  [0, 0, 0]
45     Flu  Florida  2020-02    0.0  [0, 5, 0]
46     Flu  Florida  2020-03    5.0  [5, 0, 0]
47     Flu  Florida  2020-04    0.0  [0, 0, 0]
48     Flu  Florida  2020-05    0.0  [0, 0, 0]
49     Flu  Florida  2020-06    0.0  [0, 0, 0]
50     Flu  Florida  2020-07    0.0  [0, 0, 0]
51     Flu  Florida  2020-08    0.0  [0, 0, 0]
52     Flu  Florida  2020-09    0.0  [0, 0, 0]
53     Flu  Florida  2020-10    0.0  [0, 0, 0]
54     Flu  Florida  2020-11    0.0  [0, 0, 0]
55     Flu  Florida  2020-12    0.0  [0, 0, 0]
56     Flu  Florida  2021-01    0.0  [0, 0, 0]
57     Flu  Florida  2021-02    0.0  [0, 0, 0]
58     Flu  Florida  2021-03    0.0  [0, 0, 0]
59     Flu  Florida  2021-04    0.0  [0, 0, 0]


嗨@Ynjxsjmh。太感谢了。逻辑确实令人印象深刻。在这里,我发现 ValueError: You are trying to merge on period[M] 和 object 列。如果你想继续,你应该使用 pd.concat @Roy 可能正在将您的 Month 列转换为带有 df['Month'] = df['Month'].astype(str) 的字符串。



