基于多个值合并熊猫数据框中的行

Posted

技术标签:

【中文标题】基于多个值合并熊猫数据框中的行【英文标题】:Merging rows in a pandas dataframe based on mutiple values 【发布时间】:2021-04-18 13:41:38 【问题描述】:

这本质上与Merge values of a dataframe where other columns match 有关,但由于这个问题已经得到解答,我没有找到针对不同问题的正确修改,我打开了这个新线程。希望没关系。对问题。我有以下数据

 date              car_brand    color     city      stolen
 "2020-01-01"      porsche      red       paris     False
 "2020-01-01"      porsche      red       london    False
 "2020-01-01"      porsche      red       munich    False
 "2020-01-01"      porsche      red       madrid    False
 "2020-01-01"      porsche      red       rome      False
 "2020-01-01"      porsche      blue      berlin    False 
 "2020-01-01"      porsche      blue      tokyo     False
 "2020-01-01"      porsche      blue      peking    False
 "2020-01-01"      porsche      white     liverpool False 
 "2020-01-01"      porsche      white     oslo      False
 "2020-01-01"      porsche      white     barcelona False
 "2020-01-01"      porsche      white     miami     False
 "2020-01-02"      porsche      red       paris     False
 "2020-01-02"      porsche      red       london    False
 "2020-01-02"      porsche      red       munich    False
 "2020-01-02"      porsche      red       madrid    False
 "2020-01-02"      porsche      red       rome      False
 "2020-01-02"      porsche      blue      berlin    False
 "2020-01-02"      porsche      blue      tokyo     False
 "2020-01-02"      porsche      blue      peking    False
 "2020-01-02"      porsche      white     liverpool False 
 "2020-01-02"      porsche      white     oslo      False
 "2020-01-02"      porsche      white     barcelona False
 "2020-01-02"      porsche      white     miami     False 
 "2020-01-03"      porsche      red       paris     False
 "2020-01-03"      porsche      red       london    False
 "2020-01-03"      porsche      red       munich    False
 "2020-01-03"      porsche      red       madrid    True
 "2020-01-03"      porsche      red       rome      False
 "2020-01-03"      porsche      blue      berlin    False
 "2020-01-03"      porsche      blue      tokyo     False
 "2020-01-03"      porsche      blue      peking    False
 "2020-01-03"      porsche      white     liverpool False 
 "2020-01-03"      porsche      white     oslo      False
 "2020-01-03"      porsche      white     barcelona False 
 "2020-01-03"      porsche      white     miami     False 
 "2020-01-04"      porsche      red       paris     False
 "2020-01-04"      porsche      red       london    False
 "2020-01-04"      porsche      red       munich    False
 "2020-01-04"      porsche      red       madrid    False
 "2020-01-04"      porsche      red       rome      False 
 "2020-01-04"      porsche      blue      berlin    False
 "2020-01-04"      porsche      blue      tokyo     False
 "2020-01-04"      porsche      blue      peking    False 
 "2020-01-04"      porsche      white     liverpool False
 "2020-01-04"      porsche      white     oslo      False
 "2020-01-04"      porsche      white     barcelona False
 "2020-01-04"      porsche      white     miami     False

我知道如何根据以下方式创建一个数据框:如果连续几天布尔“被盗”匹配所有条目,那么我想合并日期列。例如,在上面的示例中,布尔条目匹配“2020-01-01”和“2020-01-02”。所以总的来说,我想得到以下结果:

 date                             car_brand    color     city      stolen
 ["2020-01-01","2020-01-02"]      porsche      red       paris     False
 ["2020-01-01","2020-01-02"]      porsche      red       london    False
 ["2020-01-01","2020-01-02"]      porsche      red       munich    False
 ["2020-01-01","2020-01-02"]      porsche      red       madrid    False
 ["2020-01-01","2020-01-02"]      porsche      red       rome      False
 ["2020-01-01","2020-01-02"]      porsche      blue      berlin    False 
 ["2020-01-01","2020-01-02"]      porsche      blue      tokyo     False
 ["2020-01-01","2020-01-02"]      porsche      blue      peking    False
 ["2020-01-01","2020-01-02"]      porsche      white     liverpool False 
 ["2020-01-01","2020-01-02"]      porsche      white     oslo      False
 ["2020-01-01","2020-01-02"]      porsche      white     barcelona False
 ["2020-01-01","2020-01-02"]      porsche      white     miami     False
 ["2020-01-03"]                   porsche      red       paris     False
 ["2020-01-03"]                   porsche      red       london    False
 ["2020-01-03"]                   porsche      red       munich    False
 ["2020-01-03"]                   porsche      red       madrid    True
 ["2020-01-03"]                   porsche      red       rome      False
 ["2020-01-03"]                   porsche      blue      berlin    False
 ["2020-01-03"]                   porsche      blue      tokyo     False
 ["2020-01-03"]                   porsche      blue      peking    False
 ["2020-01-03"]                   porsche      white     liverpool False 
 ["2020-01-03"]                   porsche      white     oslo      False
 ["2020-01-03"]                   porsche      white     barcelona False 
 ["2020-01-03"]                   porsche      white     miami     False 
 ["2020-01-04"]                   porsche      red       paris     False
 ["2020-01-04"]                   porsche      red       london    False
 ["2020-01-04"]                   porsche      red       munich    False
 ["2020-01-04"]                   porsche      red       madrid    False
 ["2020-01-04"]                   porsche      red       rome      False 
 ["2020-01-04"]                   porsche      blue      berlin    False
 ["2020-01-04"]                   porsche      blue      tokyo     False
 ["2020-01-04"]                   porsche      blue      peking    False 
 ["2020-01-04"]                   porsche      white     liverpool False
 ["2020-01-04"]                   porsche      white     oslo      False
 ["2020-01-04"]                   porsche      white     barcelona False
 ["2020-01-04"]                   porsche      white     miami     False

【问题讨论】:

根据我的理解,同样的解决方案有效......为什么 3-Jan, Porsche, Paris, red 不是所需输出中的第 1 和第 3?所有的都没有被盗 根据我的理解,同样的解决方案有效:我试过了,但它对我不起作用。为什么 3-Jan, Porsche, Paris, red 在所需的输出中没有第一和第三?一切都没有被偷走:这是真的。但我想结合所有被盗布尔值相等的连续天数。所以在第三个有一辆被盗的保时捷(城市和颜色无关紧要)。所以我想分别输出整个 1 月 3 日的输出。 @RobRaymond 根据我的理解,同样的解决方案有效:问题不在于我如何使用 groupby,我最终会得到一个数据框,其中 solen = True 的行作为单列。 【参考方案1】:

为了简短起见,代码没有从示例数据构建数据框。

关键技术是一个在日期 被盗更改的新列。 increment on value change

df["date"] = pd.to_datetime(df["date"])

# require new group when there is a stolen car in any date
df2 = (df.groupby("date")["stolen"].max().to_frame()
 .reset_index()
 .assign(stolen_grp=lambda dfa: (dfa.stolen.diff() != 0).cumsum())
 .drop(columns="stolen")
)

# put stolen_grp back into dataframe
df = df.merge(df2, on="date")

# same technique, breaking on days a car has been stolen
(
    df
    .groupby([c for c in df.columns if c!="date"])["date"]
    # only include if first date or if it's a consequetive date
    .agg(lambda x: [xx for i,xx in enumerate(x) if i==0 or xx==(list(x)[i-1]+pd.DateOffset(1))])
    .reset_index()
    .drop(columns="stolen_grp")
)

样本输出

car_brand color   city  stolen                                       date
  porsche  blue berlin   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue berlin   False                      [2020-01-03 00:00:00]
  porsche  blue berlin   False                      [2020-01-04 00:00:00]
  porsche  blue peking   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue peking   False                      [2020-01-03 00:00:00]
  porsche  blue peking   False                      [2020-01-04 00:00:00]
  porsche  blue  tokyo   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue  tokyo   False                      [2020-01-03 00:00:00]
  porsche  blue  tokyo   False                      [2020-01-04 00:00:00]
  porsche   red london   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]

【讨论】:

以上是关于基于多个值合并熊猫数据框中的行的主要内容,如果未能解决你的问题,请参考以下文章

熊猫与“左”选项合并正在丢失左侧数据框中的行

熊猫数据框中的内部连接/合并比左数据框提供更多的行

如何在 R 中合并同一数据框中的行(基于特定列下的重复值)?

比较熊猫数据框中的行值

比较熊猫数据框中的行值

如何获取熊猫数据框中的行,列中具有最大值并保留原始索引?