使用包含嵌套列表的现有列的 出现总和创建一个新列

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用包含嵌套列表的现有列的 出现总和创建一个新列相关的知识,希望对你有一定的参考价值。

我有一个相对较大的数据框,如下所示:

(我在这里上传了csv文件 - ufile.io/526t4)

    value
0   [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"],[182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
1   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
2   [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]
3   [[20,79,"D"]]
...
12352   [[25,36,"S"],[37,89,"C"],[90,115,"S"]]
12353   [[1,16,"D"],[17,407,"C"],[408,416,"D"]]
12354   [[16,21,"D"],[22,108,"C"],[109,123,"D"],[124,164,"C"],[165,421,"S"]]
12355 rows × 1 columns

我想创建一个新列,其中包含所有“D”次出现的总和

以第一行为例:

x = [[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"][121,181,"S"],182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]]
new_colum_D = (sum([y[1]-y[0] for y in x if y[2]=="D"])) # applied for all rows

new_colum_D =第一行值为130

我尝试过以下方法:

df['Column_D']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

但我得到以下消息:IndexError:字符串索引超出范围

IndexError                                Traceback (most recent call last)
<ipython-input-7-f7f23d42d4e5> in <module>()
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if 
y[2]=="D"]))
~\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-7-f7f23d42d4e5> in <lambda>(x)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))
<ipython-input-7-f7f23d42d4e5> in <listcomp>(.0)
----> 1 df['sum']=df["value"].apply(lambda x:sum([y[1]-y[0] for y in x if y[2]=="D"]))

IndexError: string index out of range
答案

你很近。您可以在列表推导中构建计算结构。然后将列表分配给一系列。

您可能会觉得使用pd.DataFrame.apply进行计算的矢量化,但事实并非如此:apply只是一个带有一些额外开销的薄薄环路。

df = pd.DataFrame({'value': [[[1,92,"D"],[93,93,"C"],[94,113,"S"],[114,120,"C"],[121,181,"S"], [182,187,"C"],[188,292,"S"],[319,319,"S"],[320,353,"C"],[354,393,"D"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]],
                             [[18,23,"D"],[24,27,"C"],[28,186,"S"],[187,198,"C"],[199,246,"S"]]]})

df['value'] = [sum([y[1]-y[0] for y in x if y[2]=="D"]) for x in df['value']]

print(df)

   value
0    130
1      5
2      5

以上是关于使用包含嵌套列表的现有列的 出现总和创建一个新列的主要内容,如果未能解决你的问题,请参考以下文章

如何创建创建新列并修改现有列的 UDF

在现有列的基础上在 DataFrame 中添加新列

Pandas列表的列,通过迭代(选择)三列的每个列表元素作为新列和行来创建多列[重复]

对 data.frame 或矩阵中的行求和

Phpmyadmin - 将新列导入现有记录

Python pandas - 如果项目在列表中,则为新列的值