如何在python中的字典列表中查找项目的累积总和

Posted

技术标签:

【中文标题】如何在python中的字典列表中查找项目的累积总和【英文标题】:How to find cumulative sum of items in a list of dictionaries in python 【发布时间】:2016-12-27 12:31:54 【问题描述】:

我有一个类似的列表

a=['time':3,'time':4,'time':5]

我想像这样以相反的顺序获得值的累积总和

b=['exp':3,'cumsum':12,'exp':4,'cumsum':9,'exp':5,'cumsum':5]

最有效的方法是什么?我已经阅读了其他答案,其中使用 numpy 给出了类似

的解决方案
a=[1,2,3]
b=numpy.cumsum(a)

但我还需要在字典中插入 cumsum

【问题讨论】:

最有效的方法是使用 numpy,特别是当列表非常大时。是否需要将 cumsum 放入字典取决于您的应用程序,它可能只是来自糟糕的设计。为什么不使用pandas?它似乎非常适合您的示例。 @ImanolLuengo,谢谢。我的清单不大,但长度不一。此类列表的数量为一百万 使用timeit 或cProfile 来评估解决方案。让我们知道结果。对于将列表转换为另一种对象类型的解决方案,请务必在计时中包含该转换。 【参考方案1】:
a=['time':3,'time':4,'time':5]
b = []
cumsum = 0
for e in a[::-1]:
    cumsum += e['time']
    b.insert(0, 'exp':e['time'], 'cumsum':cumsum)
print(b)

输出:

['exp': 3, 'cumsum': 12, 'exp': 4, 'cumsum': 9, 'exp': 5, 'cumsum': 5]


所以事实证明,在列表开头插入是slow (O(n))。相反,请尝试 deque (O(1)):
from collections import deque


a=['time':3,'time':4,'time':5]
b = deque()
cumsum = 0
for e in a[::-1]:
    cumsum += e['time']
    b.appendleft('exp':e['time'], 'cumsum':cumsum)
print(b)
print(list(b))

输出:

deque(['cumsum': 12, 'exp': 3, 'cumsum': 9, 'exp': 4, 'cumsum': 5, 'exp': 5])
['cumsum': 12, 'exp': 3, 'cumsum': 9, 'exp': 4, 'cumsum': 5, 'exp': 5]


这是测试 ITT 每种方法的速度的脚本,以及带有时序结果的图表:

from collections import deque
from copy import deepcopy
import numpy as np
import pandas as pd
from random import randint
from time import time


def Nehal_pandas(l):
    df = pd.DataFrame(l)
    df['cumsum'] = df.ix[::-1, 'time'].cumsum()[::-1]
    df.columns = ['exp', 'cumsum']
    return df.to_json(orient='records')


def Merlin_pandas(l):
    df           = pd.DataFrame(l).rename(columns='time':'exp')
    df["cumsum"] = df['exp'][::-1].cumsum()
    return df.to_dict(orient='records')


def RahulKP_numpy(l):
    cumsum_list = np.cumsum([i['time'] for i in l][::-1])[::-1]
    for i,j in zip(l,cumsum_list):
        i.update('cumsum':j)


def Divakar_pandas(l):
    df = pd.DataFrame(l)
    df.columns = ['exp']
    df['cumsum'] = (df[::-1].cumsum())[::-1]
    return df.T.to_dict().values()


def cb_insert_0(l):
    b = []
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.insert(0, 'exp':e['time'], 'cumsum':cumsum)
    return b


def cb_deque(l):
    b = deque()
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.appendleft('exp':e['time'], 'cumsum':cumsum)
    b = list(b)
    return b


def cb_deque_noconvert(l):
    b = deque()
    cumsum = 0
    for e in l[::-1]:
        cumsum += e['time']
        b.appendleft('exp':e['time'], 'cumsum':cumsum)
    return b


def hpaulj_gen(l, var='value'):
    cum=0
    for i in l:
        j=i[var]
        cum += j
        yield var:j, 'sum':cum


def hpaulj_inplace(l, var='time'):
    cum = 0
    for i in l:
        cum += i[var]
        i['sum'] = cum


def test(number_of_lists, min_list_length, max_list_length):
    test_lists = []

    for _ in range(number_of_lists):
        test_list = []
        number_of_dicts = randint(min_list_length,max_list_length)
        for __ in range(number_of_dicts):
            random_value = randint(0,50)
            test_list.append('time':random_value)
        test_lists.append(test_list)

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = list(hpaulj_gen(l[::-1], 'time'))[::-1]
    elapsed_time = time() - start_time
    print('hpaulj generator:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        hpaulj_inplace(l[::-1])
    elapsed_time = time() - start_time
    print('hpaulj in place:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_insert_0(l)
    elapsed_time = time() - start_time
    print('craig insert list at 0:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_deque(l)
    elapsed_time = time() - start_time
    print('craig deque:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = cb_deque_noconvert(l)
    elapsed_time = time() - start_time
    print('craig deque no convert:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        RahulKP_numpy(l) # l changed in place
    elapsed_time = time() - start_time
    print('Rahul K P numpy:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Divakar_pandas(l)
    elapsed_time = time() - start_time
    print('Divakar pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Nehal_pandas(l)
    elapsed_time = time() - start_time
    print('Nehal pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

    lists = deepcopy(test_lists)
    start_time = time()
    for l in lists:
        res = Merlin_pandas(l)
    elapsed_time = time() - start_time
    print('Merlin pandas:'.ljust(25), '%.2f' % (number_of_lists / elapsed_time), 'lists per second')

【讨论】:

谢谢@craig,但这是合理有效的方法吗?因为我有大约一百万个这样的列表 在这里计时各种方法,让我们知道:)【参考方案2】:

基于生成器的解决方案:

def foo(a, var='value'):
    cum=0
    for i in a:
        j=i[var]
        cum += j
        yield var:j, 'sum':cum

In [79]: a=['time':i for i in range(5)]
In [80]: list(foo(a[::-1], var='time'))[::-1]
Out[80]: 
['sum': 10, 'time': 0,
 'sum': 10, 'time': 1,
 'sum': 9, 'time': 2,
 'sum': 7, 'time': 3,
 'sum': 4, 'time': 4]

在快速测试中,这与 cb_insert_0 具有竞争力

就地版本做得更好:

def foo2(a, var='time'):
    cum = 0
    for i in a:
        cum += i[var]
        i['sum'] = cum
foo2(a[::-1])

【讨论】:

Plus 用于 in-place 版本。 添加到时间 :)【参考方案3】:

这是使用pandas 的另一种方法-

df = pd.DataFrame(a)
df.columns = ['exp']
df['cumsum'] = (df[::-1].cumsum())[::-1]
out = df.T.to_dict().values()

样本输入、输出-

In [396]: a
Out[396]: ['time': 3, 'time': 4, 'time': 5]

In [397]: out
Out[397]: ['cumsum': 12, 'exp': 3, 'cumsum': 9, 'exp': 4, 'cumsum': 5, 'exp': 5

【讨论】:

pandas 将 dict 转换为 df 并返回的速度有多快?那肯定比cumsum花费更多的时间。 @hpaulj 真的不知道!我猜对一些时间会很好。【参考方案4】:

试试这个:

a            = ['time':3,'time':4,'time':5]
df           = pd.DataFrame(a).rename(columns='time':'exp')
df["cumsum"] = df['exp'][::-1].cumsum()
df.to_dict(orient='records')

字典没有顺序。

 ['cumsum': 12, 'exp': 3, 'cumsum': 9, 'exp': 4, 'cumsum': 5, 'exp': 5]

【讨论】:

【参考方案5】:

试试这个,

cumsum_list = np.cumsum([i['time'] for i in a][::-1])[::-1]
for i,j in zip(a,cumsum_list):
     i.update('cumsum':j)

结果

['cumsum': 12, 'time': 3, 'cumsum': 9, 'time': 4, 'cumsum': 5, 'time': 5]

效率

改成函数,

In [49]: def convert_dict(a):
....:     cumsum_list = np.cumsum([i['time'] for i in a][::-1])[::-1]
....:     for i,j in zip(a,cumsum_list):
....:              i.update('cumsum':j)
....:     return a

然后是结果,

In [51]: convert_dict(a)
Out[51]: ['cumsum': 12, 'time': 3, 'cumsum': 9, 'time': 4, 'cumsum': 5, 'time': 5]

最后是效率,

In [52]: %timeit convert_dict(a)
The slowest run took 12.84 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.1 µs per loop

【讨论】:

【参考方案6】:

使用pandas:

In [4]: df = pd.DataFrame(['time':3,'time':4,'time':5])

In [5]: df
Out[5]: 
   time
0     3
1     4
2     5

In [6]: df['cumsum'] = df.ix[::-1, 'time'].cumsum()[::-1]

In [7]: df
Out[7]: 
   time  cumsum
0     3      12
1     4       9
2     5       5

In [8]: df.columns = ['exp', 'cumsum']

In [9]: df
Out[9]: 
   exp  cumsum
0    3      12
1    4       9
2    5       5

In [10]: df.to_json(orient='records')
Out[10]: '["exp":3,"cumsum":12,"exp":4,"cumsum":9,"exp":5,"cumsum":5]'

【讨论】:

以上是关于如何在python中的字典列表中查找项目的累积总和的主要内容,如果未能解决你的问题,请参考以下文章

查找具有给定总和的数字列表的所有组合

使用带有 ORDER BY 的 SQL Server 查找累积总和

查找列表中哪个数字总和等于某个数字的算法

如何从从excel文件派生的大量字典中的值列表中查找最小值和最大值

在python中查找列表的子集的总和

字典列表键的Python总和[重复]