如何填充开始日期为每月第一天的缺失值?
Posted
技术标签:
【中文标题】如何填充开始日期为每月第一天的缺失值?【英文标题】:how to fill the missing values where start date has been first day of month? 【发布时间】:2021-11-03 10:00:02 【问题描述】:我有这样的数据框:
tst=
Date % on Merchant % on Customer Merchants Location
2021-08-04 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-05 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-06 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-01 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-02 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-03 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-04 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-05 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-06 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
uni_ind= ['% on Merchant','% on Customer','Merchants','Location']
我正在寻找输出:
Date % on Merchant % on Customer Merchants Location
2021-08-01 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-02 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-03 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-04 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-05 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-06 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-01 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-02 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-03 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-04 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-05 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-06 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
tst.groupby(uni_ind).resample('D').bfill()..reset_index(level=(0,1,2,3),drop= True).reset_index()
【问题讨论】:
【参考方案1】: 为缺少的商家创建月份日期范围 外部连接到原始数据框和fillna(method="bfill")
import pandas as pd
import io
df = pd.read_csv(io.StringIO("""Date % on Merchant % on Customer Merchants Location
2021-08-04 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-05 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-06 0.0 0.10 Zwarma - The Shawarma Maker Palani
2021-08-01 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-02 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-03 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-04 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-05 0.0 0.12 Zwarma - The Shawarma Maker Pollachi
2021-08-06 0.0 0.12 Zwarma - The Shawarma Maker Pollachi """), sep="\s\s+", engine="python")
df["Date"] = pd.to_datetime(df["Date"])
df = (
df.merge(
df.groupby(
[df["Date"].dt.year, df["Date"].dt.month, "Merchants", "Location"], as_index=False
)
.agg("Date": "min")
.loc[lambda d: d["Date"].dt.day.gt(1)]
.apply(
lambda r: pd.Series(
"Date": list(
pd.date_range(
r["Date"] - pd.offsets.MonthBegin(1),
r["Date"] - pd.Timedelta(days=1),
)
),
"Merchants": r["Merchants"],
"Location": r["Location"]
),
axis=1,
)
.explode("Date"),
on=["Date", "Merchants", "Location"],
how="outer",
)
.sort_values(["Merchants", "Location", "Date"])
.fillna(method="bfill")
)
df
Date | % on Merchant | % on Customer | Merchants | Location | |
---|---|---|---|---|---|
9 | 2021-08-01 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
10 | 2021-08-02 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
11 | 2021-08-03 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
0 | 2021-08-04 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
1 | 2021-08-05 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
2 | 2021-08-06 00:00:00 | 0 | 0.1 | Zwarma - The Shawarma Maker | Palani |
3 | 2021-08-01 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
4 | 2021-08-02 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
5 | 2021-08-03 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
6 | 2021-08-04 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
7 | 2021-08-05 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
8 | 2021-08-06 00:00:00 | 0 | 0.12 | Zwarma - The Shawarma Maker | Pollachi |
【讨论】:
在导入 IO 时,位置列被删除并与商家合并。该解决方案正在为此工作,但我不想合并商家和位置列..请告诉我..如果相同的解决方案可以通过选择最新的可用值来填补月末缺失的日期,那就太好了.. 从您的示例数据中,我看不到如何区分商家和位置。是最后的空间吗?解决方案真的是一样的,添加位置到groupby和系列的构造 更新了你也包括位置,只是系统地添加了【参考方案2】:下面有一个更简单的答案。
第 1 步:通过重新映射 Month start 获取月份的第一个日期 tst1 = tst.groupby(uni_ind).resample('MS').bfill().reset_index(level=(0,1,2,3,4,5),drop= True).reset_index() 第 2 步:首先使用原始 df 附加月份 tst3 = tst.reset_index().append(tst1) 第 3 步:删除重复项,因为可能有几个月开始几个月 tst3.drop_duplicates(inplace=True,ignore_index=False,keep='first') 第 4 步:将日期设置为要使用的重采样函数的索引 tst3.set_index('Date',inplace=True) 第 5 步:重新采样 df tst3.groupby(uni_ind , dropna= False).resample('D').ffill().reset_index( level=(0,1,2,3,4,5),drop= True).reset_index()
【讨论】:
以上是关于如何填充开始日期为每月第一天的缺失值?的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas 插值:在缺失的日期范围内重新分配值