基于多个值合并熊猫数据框中的行
Posted
技术标签:
【中文标题】基于多个值合并熊猫数据框中的行【英文标题】:Merging rows in a pandas dataframe based on mutiple values 【发布时间】:2021-04-18 13:41:38 【问题描述】:这本质上与Merge values of a dataframe where other columns match 有关,但由于这个问题已经得到解答,我没有找到针对不同问题的正确修改,我打开了这个新线程。希望没关系。对问题。我有以下数据
date car_brand color city stolen
"2020-01-01" porsche red paris False
"2020-01-01" porsche red london False
"2020-01-01" porsche red munich False
"2020-01-01" porsche red madrid False
"2020-01-01" porsche red rome False
"2020-01-01" porsche blue berlin False
"2020-01-01" porsche blue tokyo False
"2020-01-01" porsche blue peking False
"2020-01-01" porsche white liverpool False
"2020-01-01" porsche white oslo False
"2020-01-01" porsche white barcelona False
"2020-01-01" porsche white miami False
"2020-01-02" porsche red paris False
"2020-01-02" porsche red london False
"2020-01-02" porsche red munich False
"2020-01-02" porsche red madrid False
"2020-01-02" porsche red rome False
"2020-01-02" porsche blue berlin False
"2020-01-02" porsche blue tokyo False
"2020-01-02" porsche blue peking False
"2020-01-02" porsche white liverpool False
"2020-01-02" porsche white oslo False
"2020-01-02" porsche white barcelona False
"2020-01-02" porsche white miami False
"2020-01-03" porsche red paris False
"2020-01-03" porsche red london False
"2020-01-03" porsche red munich False
"2020-01-03" porsche red madrid True
"2020-01-03" porsche red rome False
"2020-01-03" porsche blue berlin False
"2020-01-03" porsche blue tokyo False
"2020-01-03" porsche blue peking False
"2020-01-03" porsche white liverpool False
"2020-01-03" porsche white oslo False
"2020-01-03" porsche white barcelona False
"2020-01-03" porsche white miami False
"2020-01-04" porsche red paris False
"2020-01-04" porsche red london False
"2020-01-04" porsche red munich False
"2020-01-04" porsche red madrid False
"2020-01-04" porsche red rome False
"2020-01-04" porsche blue berlin False
"2020-01-04" porsche blue tokyo False
"2020-01-04" porsche blue peking False
"2020-01-04" porsche white liverpool False
"2020-01-04" porsche white oslo False
"2020-01-04" porsche white barcelona False
"2020-01-04" porsche white miami False
我知道如何根据以下方式创建一个数据框:如果连续几天布尔“被盗”匹配所有条目,那么我想合并日期列。例如,在上面的示例中,布尔条目匹配“2020-01-01”和“2020-01-02”。所以总的来说,我想得到以下结果:
date car_brand color city stolen
["2020-01-01","2020-01-02"] porsche red paris False
["2020-01-01","2020-01-02"] porsche red london False
["2020-01-01","2020-01-02"] porsche red munich False
["2020-01-01","2020-01-02"] porsche red madrid False
["2020-01-01","2020-01-02"] porsche red rome False
["2020-01-01","2020-01-02"] porsche blue berlin False
["2020-01-01","2020-01-02"] porsche blue tokyo False
["2020-01-01","2020-01-02"] porsche blue peking False
["2020-01-01","2020-01-02"] porsche white liverpool False
["2020-01-01","2020-01-02"] porsche white oslo False
["2020-01-01","2020-01-02"] porsche white barcelona False
["2020-01-01","2020-01-02"] porsche white miami False
["2020-01-03"] porsche red paris False
["2020-01-03"] porsche red london False
["2020-01-03"] porsche red munich False
["2020-01-03"] porsche red madrid True
["2020-01-03"] porsche red rome False
["2020-01-03"] porsche blue berlin False
["2020-01-03"] porsche blue tokyo False
["2020-01-03"] porsche blue peking False
["2020-01-03"] porsche white liverpool False
["2020-01-03"] porsche white oslo False
["2020-01-03"] porsche white barcelona False
["2020-01-03"] porsche white miami False
["2020-01-04"] porsche red paris False
["2020-01-04"] porsche red london False
["2020-01-04"] porsche red munich False
["2020-01-04"] porsche red madrid False
["2020-01-04"] porsche red rome False
["2020-01-04"] porsche blue berlin False
["2020-01-04"] porsche blue tokyo False
["2020-01-04"] porsche blue peking False
["2020-01-04"] porsche white liverpool False
["2020-01-04"] porsche white oslo False
["2020-01-04"] porsche white barcelona False
["2020-01-04"] porsche white miami False
【问题讨论】:
根据我的理解,同样的解决方案有效......为什么 3-Jan, Porsche, Paris, red 不是所需输出中的第 1 和第 3?所有的都没有被盗 根据我的理解,同样的解决方案有效:我试过了,但它对我不起作用。为什么 3-Jan, Porsche, Paris, red 在所需的输出中没有第一和第三?一切都没有被偷走:这是真的。但我想结合所有被盗布尔值相等的连续天数。所以在第三个有一辆被盗的保时捷(城市和颜色无关紧要)。所以我想分别输出整个 1 月 3 日的输出。 @RobRaymond 根据我的理解,同样的解决方案有效:问题不在于我如何使用 groupby,我最终会得到一个数据框,其中 solen = True 的行作为单列。 【参考方案1】:为了简短起见,代码没有从示例数据构建数据框。
关键技术是一个在日期 被盗更改的新列。 increment on value change
df["date"] = pd.to_datetime(df["date"])
# require new group when there is a stolen car in any date
df2 = (df.groupby("date")["stolen"].max().to_frame()
.reset_index()
.assign(stolen_grp=lambda dfa: (dfa.stolen.diff() != 0).cumsum())
.drop(columns="stolen")
)
# put stolen_grp back into dataframe
df = df.merge(df2, on="date")
# same technique, breaking on days a car has been stolen
(
df
.groupby([c for c in df.columns if c!="date"])["date"]
# only include if first date or if it's a consequetive date
.agg(lambda x: [xx for i,xx in enumerate(x) if i==0 or xx==(list(x)[i-1]+pd.DateOffset(1))])
.reset_index()
.drop(columns="stolen_grp")
)
样本输出
car_brand color city stolen date
porsche blue berlin False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
porsche blue berlin False [2020-01-03 00:00:00]
porsche blue berlin False [2020-01-04 00:00:00]
porsche blue peking False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
porsche blue peking False [2020-01-03 00:00:00]
porsche blue peking False [2020-01-04 00:00:00]
porsche blue tokyo False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
porsche blue tokyo False [2020-01-03 00:00:00]
porsche blue tokyo False [2020-01-04 00:00:00]
porsche red london False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
【讨论】:
以上是关于基于多个值合并熊猫数据框中的行的主要内容,如果未能解决你的问题,请参考以下文章