如何避免循环遍历 pandas 中的分类变量以查看/操作数据帧切片/子集

Posted 2023-02-23

技术标签:

【中文标题】如何避免循环遍历 pandas 中的分类变量以查看/操作数据帧切片/子集【英文标题】：How to avoid looping through categorical variables in pandas to view/operate on dataframe slices/subsets 【发布时间】：2021-04-12 09:50:47 【问题描述】：

我有一个带有分类变量的大型数据框。我想从数据框的子集中提取属于分类变量的每个值的值，并将其保存为列表的集合（在我提供的代码示例中用于创建稀疏向量）。

我当前的方法遍历分类变量的每个值，选择具有该值的数据框，然后从该子数据框中提取列表。它很慢，我认为是由于两件事：循环数据帧和创建子数据帧。

我想加快这个过程并找出一种方法来避免这种通过临时数据帧的循环（我发现自己在我的代码中经常这样做）。为了给我当前项目的规模感，我在 500 万个观测值上有大约 7k 个类别。我在下面包含代码来演示我当前的工作流程：

数据框设置：

import pandas as pd

c1=['a','b','c','d','e']*5
c2=[4,8,3,5,6]*6
c3=list(range(1,11))*3

df=pd.DataFrame(list(zip(c1,c2,c3)),columns=['catvar','weight','loc'])

循环遍历数据帧子集的函数：

from scipy.sparse import csr_matrix

def make_sparse_vectors(df,
                        loc_colname='loc',
                        weighting_colname='weight',
                        cat_colname='catvar',
                       ):
    # create list of ids:
    id_list=list(df[cat_colname].unique())

    # length of sparse vector:
    vlength=max(df[loc_colname])+1

    # loop to create sparse vectors:
    sparse_vector_dict=
    for i in id_list:
        df_temp=df[df[cat_colname]==i]

        temp_loc_list=df_temp[loc_colname].tolist()
        temp_weight=df_temp[weighting_colname].tolist()
        temp_row_list=[0]*len(temp_loc_list)

        sparse_vector_dict[i]=csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
    
    return sparse_vector_dict

make_sparse_vectors(df)

'a': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'b': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'c': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'd': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'e': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>

我认为最可以优化的代码 sn-p 是我循环唯一值并创建临时数据框的点：

for i in id_list:
    df_temp=df[df[cat_colname]==i]

一些想法：

Pandas 的 groupby() 函数似乎很理想，但从我在文档中可以看出，它主要用于降低数据帧的维度。虽然在某些情况下很有用，但它不适用于这个问题（因为我要提取的列表总体上与数据框的维度相同）屏蔽可能会有所帮助，但我一直想不出一个可以让我在不涉及循环的情况下解决此问题的屏蔽。

【问题讨论】：

csr_matrix 在做什么？ @Kenan 它正在为每个分类变量创建一个稀疏矩阵（使用从数据框中提取的权重和位置/索引）。这就是我在执行我关心的循环时创建的列表。我发现我不知何故错过了复制和粘贴包含导入的行 - 我会将其添加到我的问题中的代码中。 【参考方案1】：

我不确定你想返回什么，但你应该使用groupby。我就是这样做的

loc_colname='loc'
weighting_colname='weight'
cat_colname='catvar'
vlength = max(df[loc_colname]+1)

def create sparse vectors(df_temp):
    temp_loc_list=df_temp[loc_colname].tolist()
    temp_weight=df_temp[weighting_colname].tolist()
    temp_row_list=[0]*len(temp_loc_list)

    return csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))

new_df = df.groupby(cat_colname).apply(create sparse vectors) 要获取字典，请阅读更多 here

df_dict = new_df.to_dict()

您还可以使用swifter or dask 大大加快此过程。但是，如果偷听太多，这可能会更慢。

fast_df = df.groupby(cat_colname).swifter.apply(create sparse vectors)

【讨论】：

我明白了，所以 apply() 允许我将 groupby() 与自定义函数一起使用。这里的关键区别在于它返回了一个包含信息的数据框，而我之前的方法返回了一个包含相同信息的字典。基于使用 timeit 运行这个玩具示例 1000 次，您的方法大约快 15%。更新了多处理方法并将 df 转换为 dict 我阅读了您发布的有关 swifter/dask 的链接，它确实看起来很有帮助。但是，在尝试运行您的示例代码时出现错误：AttributeError: 'DataFrameGroupBy' object has no attribute 'swifter'。基于this 线程，它尚未实现使用 swifter 执行 groupby()。哦，我明白了，也许然后试试pandarallel 如果您对我的解决方案感到满意，请不要忘记接受答案，以便可以关闭此问题。

以上是关于如何避免循环遍历 pandas 中的分类变量以查看/操作数据帧切片/子集的主要内容，如果未能解决你的问题，请参考以下文章