在每个元素都是列表的数据帧中运行 Scipy Linregress
Posted
技术标签:
【中文标题】在每个元素都是列表的数据帧中运行 Scipy Linregress【英文标题】:Running Scipy Linregress Across Dataframe Where Each Element is a List 【发布时间】:2021-11-12 15:15:25 【问题描述】:我正在使用 Pandas 数据框,其中每个元素都包含一个值列表。我想在数据框中的每一行的第一列中的列表和每个后续列中的列表之间运行回归,并存储每个回归的 t-stats(当前使用 numpy 数组来存储它们)。我可以使用循环遍历每一行和每一列的嵌套 for 循环来做到这一点,但对于我正在处理的数据量而言,性能并不是最佳的。
这是我目前所拥有的快速示例:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
)
数据如下所示:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
我运行回归和存储 t-stats 的代码:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
上面的代码工作得很好,完全符合我的需要,但是正如我上面提到的,当 df 中的行数和列数大幅增加时,性能开始变慢。
我希望有人可以就如何优化我的代码以获得更好的性能提供建议。
谢谢!
【问题讨论】:
【参考方案1】:我是新手,但我会优化您的原始代码:
纯粹使用 python 内置列表对象(没有必要使用 pandas,老实说,我找不到比原始代码更好的方法来解决你在 pandas 中的问题:D)
通过使用 numpy,它应该(至少他们声称)比 python 内置列表更快。
你可以跳转查看代码,它是 Jupyter notebook 格式的,所以你需要先installJupyter。
结论
这是测试结果: 在包含 (30,) 长度随机列表的 (100, 100) 矩阵上, 总时差在 1 秒左右。
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
参考
test_perf
在结果示例代码中。
PS:在测试过程中只使用了一个线程,所以也许多线程将有助于提高性能,但这超出了我的能力......
想法
我认为numpy.nditer
适合您的要求,虽然优化的结果不是那么显着。这是我的想法:
生成输入数组
我已经修改了脚本的第一部分,我认为使用列表推导足以构建一个随机列表矩阵。参考
get_matrix_from_builtin
。
请注意,我已将随机列表存储在另一个 1 元素元组中,以保持 ndarray 从 numpy 生成的形状。
作为比较,你也可以用 numpy 构造这样的矩阵。参考
get_matrix_from_numpy
。
因为 ndarray 尝试广播类似列表的对象(而且我不知道如何停止它),所以我必须将它包装到一个元组中以避免来自 numpy.array
构造函数的自动广播。如果有人有更好的解决方案,请注明,谢谢:)
计算结果
我使用pandas.DataFrame
更改了您的原始代码,以按行/列索引访问元素,但事实并非如此。
Pandas为DataFrame提供了一些迭代工具:pipe
、apply
、agg
和appymap
,更多信息请搜索API,但这里似乎不适合您的要求,因为您想获取当前索引迭代期间的行和列。
我搜索并发现numpy.nditer
可以满足这个需求:它返回一个ndarray 的迭代器,它有一个属性multi_index
提供当前元素的行/列对。见iterating-over-arrays
解释solve.ipynb
我用 Jupyter Notebook 来测试这个,你可能需要一个,here 是安装说明。
我更改了您的原始代码,删除了 pandas 的请求和纯粹使用的内置列表。参考
old_calc_tstat
在示例代码中。
另外,我使用numpy.nditer
来计算您的 tstats 矩阵,请参阅
new_calc_tstat
在示例代码中。
然后,我测试了两种方法的结果是否相等,我使用相同的输入数组以确保随机不会影响测试。参考
test_equal
结果。
最后,做时间表演。我没有耐心所以只运行一次,您可以在
test_perf
函数。
代码
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://***.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()
【讨论】:
非常感谢您的详细回复。我仍在完成整个过程,但这非常有帮助。非常感谢!以上是关于在每个元素都是列表的数据帧中运行 Scipy Linregress的主要内容,如果未能解决你的问题,请参考以下文章
有没有办法遍历 Playwright 中的 <li> 列表并单击每个元素?