Python pandas 使用附加字符串列汇总数据框中的往返数据 [关闭]

Posted 2023-03-11

技术标签:

【中文标题】Python pandas 使用附加字符串列汇总数据框中的往返数据 [关闭]【英文标题】：Python pandas summarize round trip data in dataframe with an additional string column [closed] 【发布时间】：2021-04-02 10:24:20 【问题描述】：

最近，有人帮助我回答了一个问题https://***.com/a/65417494/14872543，但我没有足够的知识来修改函数来解决相同的问题，即如果出现额外的字符串列，则在数据框中获取回程次数。

   station from  station to  lgot  count  
0         20001       20040  stud     22   
1         20001       20040   fed     33  
0         20040       20001  stud     44
2         20040       20001   reg     55 
3         20002       20015  stud     66 
3         20015       20002  stud     77 

   station from  station to  lgot  count  count_back
0         20001       20040  stud     22          44
1         20001       20040   fed     33           0
2         20040       20001   reg     55           0
3         20002       20015  stud     66          77

我的解决方案，将 lgot 替换为 int lgot id（lgot ~7 的类型不多），将“station from”“station to”列连接使用解决方案中提出的功能。执行结果数据帧的反向转换。可能是对函数工作原理的误解

df.head()
    station from    station to  lgot    count
0   2030080         2030000     full    464
1   2030000         2030080     full    395
2   2030150         2030000     full    330
3   2030000         2030150     full    285
4   2030240         2030000     full    249

df.loc[df['lgot'] == 'full', 'lgot'] = '11'
df.loc[df['lgot'] == 'rzd', 'lgot'] = '22'
df.loc[df['lgot'] == 'fed', 'lgot'] = '33'
df.loc[df['lgot'] == 'reg', 'lgot'] = '44'
df.loc[df['lgot'] == 'stud', 'lgot'] = '55'
df.loc[df['lgot'] == 'voen', 'lgot'] = '66'

df['station to'] = df['station to'].astype('string')+df['lgot']
df['station from'] = df['station from'].astype('string')+df['lgot']

df['station to'] = df['station to'].astype('int')
df['station from'] = df['station from'].astype('int')

df.drop(['lgot'], axis='columns', inplace=True)

def roundtrip(df):
    a, b, c, d = 'station from', 'station to', 'count', 'count_back'
    idx = df[a] > df[b]
    df = df.assign(**d: 0)
    df.loc[idx, [a, b, c, d]] = df.loc[idx, [b, a, d, c]].values
    return df.groupby([a, b]).sum()

df = roundtrip(df)
df= df.reset_index()

df['lgot'] = df["station from"].astype('string').str.slice(start=-2)
df['station from'] = df['station from'].astype('string').str.slice(stop=7)
df['station to'] = df['station to'].astype('string').str.slice(stop=7)

df.head()
    station from    station to  count   count_back  lgot
0   1003704         2030133     0       1           11
1   1003704         2030160     0       1           11
2   1003704         2031321     0       1           11
3   1003704         2030132     0       1           22
4   1003704         2030133     0       1           22

【问题讨论】：

在简单地寻求解决方案之前，您至少应该展示您所做的工作。好的，我添加了，但看起来很恶心，就像我在用 excel 工作一样 :) 【参考方案1】：

Pierre 的解决方案不再适用于其他问题；因为，使用新数据 df[a] > df[b] 失败，因为第五行现在小于第四行。因此，使用新数据执行此操作的最佳方法是使用.shift()。此外，您可以将sort=False 传递给您的 groupby 以提高性能和维护秩序。最后，我使用了.reset_index()，并根据新数据修改了a,b,c,d,e的列变量。

def roundtrip(df):
    a, b, c, d, e = 'station from', 'station to', 'lgot', 'count', 'count_back'
    idx = (df[a] == df[b].shift()) & (df[b] == df[a].shift())
    df = df.assign(**e: 0)
    df.loc[idx, [a, b, c, d, e]] = df.loc[idx, [b, a, c, e, d]].values
    return df.groupby([a, b, c], sort=False).sum().reset_index()


roundtrip(df)
Out[1]: 
   station from  station to  lgot  count  count_back
0         20001       20040  stud     22          44
1         20001       20040   fed     33           0
2         20040       20001   reg     55           0
3         20002       20015  stud     66          77

【讨论】：

我会尝试理解代码。我在我的数据集上尝试了该功能，前两行计数正确，然后出现故障。

station from	station to	lgot	count	count_back 0	2030080	2030000	full	464	395 1	2030150	2030000	full	330	285 2	2030240	2030000	full	249	0 3	2030080	2030122	reg	225	0 4	2030000	2030240	full	211	0

@СергейШамсуаров 我提供的解决方案适用于您提供的示例数据。评论中的新数据看起来像一个完全不同的模式。请接受作为解决方案并创建另一个问题。如果您想获得正确的答案，请确保在您的问题中包含正确的数据。谢谢！谢谢我是python新手，第一天就开始使用***了。使用扩展日期设置应用程序和附加评论创建类似请求是否正常？

以上是关于Python pandas 使用附加字符串列汇总数据框中的往返数据 [关闭]的主要内容，如果未能解决你的问题，请参考以下文章