在每个元素都是列表的数据帧中运行 Scipy Linregress

Posted

技术标签:

【中文标题】在每个元素都是列表的数据帧中运行 Scipy Linregress【英文标题】:Running Scipy Linregress Across Dataframe Where Each Element is a List 【发布时间】:2021-11-12 15:15:25 【问题描述】:

我正在使用 Pandas 数据框,其中每个元素都包含一个值列表。我想在数据框中的每一行的第一列中的列表和每个后续列中的列表之间运行回归,并存储每个回归的 t-stats(当前使用 numpy 数组来存储它们)。我可以使用循环遍历每一行和每一列的嵌套 for 循环来做到这一点,但对于我正在处理的数据量而言,性能并不是最佳的。

这是我目前所拥有的快速示例:

import numpy as np
import pandas as pd
from scipy.stats import linregress

df = pd.DataFrame(
    'a': [list(np.random.rand(11)) for i in range(100)],
     'b': [list(np.random.rand(11)) for i in range(100)],
     'c': [list(np.random.rand(11)) for i in range(100)],
     'd': [list(np.random.rand(11)) for i in range(100)],
     'e': [list(np.random.rand(11)) for i in range(100)],
     'f': [list(np.random.rand(11)) for i in range(100)]
    
)

数据如下所示:

a   b   c   d   e   f
0   [0.279347961395256, 0.07198822780319691, 0.209...   [0.4733815106836531, 0.5807425586417414, 0.068...   [0.9377037591435088, 0.9698329284595916, 0.241...   [0.03984770879654953, 0.650429630364027, 0.875...   [0.04654151678901641, 0.1959629573862498, 0.36...   [0.01328000288459652, 0.10429773699794731, 0.0...
1   [0.1739544898167934, 0.5279297754363472, 0.635...   [0.6464841177367048, 0.004013634850660308, 0.2...   [0.0403944630279538, 0.9163938509072009, 0.350...   [0.8818108296208096, 0.2910758930807579, 0.739...   [0.5263032002243185, 0.3746299115677546, 0.122...   [0.5511171062367501, 0.327702669239891, 0.9147...
2   [0.49678125158054476, 0.807770957943305, 0.396...   [0.6218806473477556, 0.01720135741717188, 0.15...   [0.6110516368605904, 0.20848099927159314, 0.51...   [0.7473669581190695, 0.5107081859246958, 0.442...   [0.8231961741887535, 0.9686869510163731, 0.473...   [0.34358121300094313, 0.9787339533782848, 0.72...
3   [0.7672751789941814, 0.412055981587398, 0.9951...   [0.8470471648467321, 0.9967427749160083, 0.818...   [0.8591072331661481, 0.6279199806511635, 0.365...   [0.9456189188046846, 0.5084362869897466, 0.586...   [0.2685328112579779, 0.8893788305422594, 0.235...   [0.029919732007230193, 0.6377951981939682, 0.1...
4   [0.21420195955828203, 0.15178914447352077, 0.9...   [0.6865307542882283, 0.0620359602798356, 0.382...   [0.6469510945986712, 0.676059598071864, 0.0396...   [0.2320436872397288, 0.09558341089961908, 0.98...   [0.7733653233006889, 0.2405189745554751, 0.016...   [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95  [0.42373270776373506, 0.7731750012629109, 0.90...   [0.9430465078763153, 0.8506292743184455, 0.567...   [0.41367168515273345, 0.9040247409476362, 0.72...   [0.23016875953835192, 0.8206550830081965, 0.26...   [0.954233948805146, 0.995068745046983, 0.20247...   [0.26269690906898413, 0.5032835345055103, 0.26...
96  [0.36114607798432685, 0.11322299769211142, 0.0...   [0.729848741496316, 0.9946930423163686, 0.2265...   [0.17207915211677138, 0.3270055732644267, 0.73...   [0.13211243241239223, 0.28382298905995607, 0.2...   [0.03915259352564071, 0.05639914089770948, 0.0...   [0.12681415759423675, 0.006417761276839351, 0....
97  [0.5020186971295065, 0.04018166955309821, 0.19...   [0.9082402680300308, 0.1334790715379094, 0.991...   [0.7003469664104871, 0.9444397336912727, 0.113...   [0.7982221018200218, 0.9097963438776192, 0.163...   [0.07834894180973451, 0.7948519146738178, 0.56...   [0.5833962514812425, 0.403689767723475, 0.7792...
98  [0.16413822314461857, 0.40683312270714234, 0.4...   [0.07366489230864415, 0.2706766599711766, 0.71...   [0.6410967759869383, 0.5780018716586993, 0.622...   [0.5466463581695835, 0.4949639043264169, 0.749...   [0.40235314091318986, 0.8305539205264385, 0.35...   [0.009668651763079184, 0.8071825962911674, 0.0...
99  [0.8189246990381518, 0.69175150213841, 0.82687...   [0.40469941577758317, 0.49004906937461257, 0.7...   [0.4940080411615112, 0.33621539942693246, 0.67...   [0.8637418291877355, 0.34876318713083676, 0.09...   [0.3526913672876807, 0.5177762589812651, 0.746...   [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns

我运行回归和存储 t-stats 的代码:

rows = len(df)
cols = len(df.columns)

tstats = np.zeros(shape=(rows,cols-1))

for i in range(0,rows):
    
    for j in range(1,cols):
        
        lg = linregress(df.iloc[i,0],df.iloc[i,j])
        
        tstats[i,j-1] = lg.slope/lg.stderr

上面的代码工作得很好,完全符合我的需要,但是正如我上面提到的,当 df 中的行数和列数大幅增加时,性能开始变慢。

我希望有人可以就如何优化我的代码以获得更好的性能提供建议。

谢谢!

【问题讨论】:

【参考方案1】:

我是新手,但我会优化您的原始代码:

纯粹使用 python 内置列表对象(没有必要使用 pandas,老实说,我找不到比原始代码更好的方法来解决你在 pandas 中的问题:D)

通过使用 numpy,它应该(至少他们声称)比 python 内置列表更快。

你可以跳转查看代码,它是 Jupyter notebook 格式的,所以你需要先installJupyter。

结论

这是测试结果: 在包含 (30,) 长度随机列表的 (100, 100) 矩阵上, 总时差在 1 秒左右。

Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.

参考 test_perf 在结果示例代码中。

PS:在测试过程中只使用了一个线程,所以也许多线程将有助于提高性能,但这超出了我的能力......

想法

我认为numpy.nditer 适合您的要求,虽然优化的结果不是那么显着。这是我的想法:

    生成输入数组

    我已经修改了脚本的第一部分,我认为使用列表推导足以构建一个随机列表矩阵。参考 get_matrix_from_builtin。 请注意,我已将随机列表存储在另一个 1 元素元组中,以保持 ndarray 从 numpy 生成的形状。

    作为比较,你也可以用 numpy 构造这样的矩阵。参考 get_matrix_from_numpy。 因为 ndarray 尝试广播类似列表的对象(而且我不知道如何停止它),所以我必须将它包装到一个元组中以避免来自 numpy.array 构造函数的自动广播。如果有人有更好的解决方案,请注明,谢谢:)

    计算结果

    我使用pandas.DataFrame 更改了您的原始代码,以按行/列索引访问元素,但事实并非如此。 Pandas为DataFrame提供了一些迭代工具:pipeapplyaggappymap,更多信息请搜索API,但这里似乎不适合您的要求,因为您想获取当前索引迭代期间的行和列。

    我搜索并发现numpy.nditer 可以满足这个需求:它返回一个ndarray 的迭代器,它有一个属性multi_index 提供当前元素的行/列对。见iterating-over-arrays

解释solve.ipynb

我用 Jupyter Notebook 来测试这个,你可能需要一个,here 是安装说明。

我更改了您的原始代码,删除了 pandas 的请求和纯粹使用的内置列表。参考 old_calc_tstat 在示例代码中。

另外,我使用numpy.nditer 来计算您的 tstats 矩阵,请参阅 new_calc_tstat 在示例代码中。

然后,我测试了两种方法的结果是否相等,我使用相同的输入数组以确保随机不会影响测试。参考 test_equal 结果。

最后,做时间表演。我没有耐心所以只运行一次,您可以在 test_perf函数。

代码

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://***.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#

# %%
import sys
import time
import numpy as np
from scipy.stats import linregress


# %%
def get_matrix_from_builtin():
    # use builtin list to construct matrix of random list
    # note I put random list inside a tuple to keep it same shape
    # as I later use numpy to do the same thing.
    return [
        [(list(np.random.rand(11)),)
         for col in range(6)]
        for row in range(100)
    ]


# %timeit get_matrix_from_builtin()


# %%
def get_matrix_from_numpy(
    gen=np.random.rand,
    shape=(1, 1),
    nest_shape=(1, ),
):

    # custom dtype for random lists
    mydtype = [
        ('randonlist', 'f', nest_shape)
    ]
    a = np.empty(shape, dtype=mydtype)
    # [DOC] moditfying array values
    # https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
    # enable per operation flags 'readwrite' to modify element in ndarray
    # enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
    with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
        for x in it:
            # pack list in a 1-d turple to prevent numpy boardcast it
            x[...] = (gen(nest_shape[0]), )

    return a


def test_get_matrix_from_numpy():
    gen = np.random.rand  # generator of random list
    shape = (6, 100)      # shape of matrix to hold random lists
    nest_shape = (11, )   # shape of random lists
    return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]


# %timeit test_get_matrix_from_numpy()


# %%
def test_get_matrix_from_numpy():
    gen = np.random.rand
    shape = (6, 100)
    nest_shape = (11, )
    return get_matrix_from_numpy(gen, shape, nest_shape)


# %%
def old_calc_tstat(a=None):
    if a is None:
        a = get_matrix_from_builtin()
        a = np.array(a)
    rows, cols = a.shape[:2]
    tstats = np.zeros(shape=(rows, cols))
    for i in range(0, rows):
        for j in range(1, cols):
            lg = linregress(a[i][0][0], a[i][j][0])
            tstats[i, j-1] = lg.slope/lg.stderr

    return tstats


# %%
def new_calc_tstat(a=None):
    # read input metrix of random lists
    if a is None:
        gen = np.random.rand
        shape = (6, 100)
        nest_shape = (11, )
        a = get_matrix_from_numpy(gen, shape, nest_shape)
    # construct ndarray for t-stat result
    tstats = np.empty(a.shape)
    # enable global flags 'multi_index' to retrive index of current element
    # [DOC] Tracking an Index or Multi-Index
    # https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
    it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
    # obtain total columns count of tstats's shape
    col = tstats.shape[1]
    for x in it:
        i, j = it.multi_index
        # trick to avoid IndexError: substract len(list) after +1 to index
        j = j + 1 - col
        lg = linregress(
            a[i][0][0],
            a[i][j][0]
        )
        # note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
        # you have to override it manually:
        if lg.stderr == 0:
            x[...] = 0
        else:
            x[...] = lg.slope / lg.stderr
    return tstats


# new_calc_tstat()


# %%
def test_equal():
    """Test if the new method has equal output to old one"""
    # use same input list to avoid affect of rand
    a = test_get_matrix_from_numpy()
    old = old_calc_tstat(a)
    new = new_calc_tstat(a)
    print(
        "Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
            old.shape == new.shape, old.shape, new.shape),
    )
    res = (old == new)
    print(
        "Is the result object same?"
    )
    if res.all() == True:
        print("True.")
    else:
        print("False. Difference(new - old) as below:\n")
        print(new - old)

    return old, new


old, new = test_equal()


# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]


# %%
# python version: 3.8.8

timer = time.clock if sys.platform[:3] == 'win' else time.time


def total(func, *args, _reps=1, **kwargs):
    start = timer()
    for i in range(_reps):
        ret = func(*args, **kwargs)
    elapsed = timer() - start
    return elapsed


def test_perf():
    """Test of performance"""
    # first, get a larger input array
    gen = np.random.rand
    shape = (1000, 100)
    nest_shape = (30, )
    a = get_matrix_from_numpy(gen, shape, nest_shape)

    # repeat how many time for each test
    reps = 1

    # then, time both old and new calculation method
    old = total(old_calc_tstat, a, _reps=reps)
    new = total(new_calc_tstat, a, _reps=reps)

    msg = "Time elapsed to run %d times on %s is %f seconds."
    print(msg % (reps, 'new method', new))
    print(msg % (reps, 'old method', old))


test_perf()

【讨论】:

非常感谢您的详细回复。我仍在完成整个过程,但这非常有帮助。非常感谢!

以上是关于在每个元素都是列表的数据帧中运行 Scipy Linregress的主要内容,如果未能解决你的问题,请参考以下文章

性能将列表列表解压缩到pandas数据帧中

有没有办法遍历 Playwright 中的 <li> 列表并单击每个元素?

在 Spark 中使用 LSH 对数据帧中的每个点运行最近邻查询

03_set集合

龙猫python

如何在两个 Pandas 数据帧中找到元素调和平均值