如何将稀疏的 pandas 数据帧转换为 2d numpy 数组

Posted 2023-03-12

技术标签:

【中文标题】如何将稀疏的 pandas 数据帧转换为 2d numpy 数组【英文标题】：How to transform a sparse pandas dataframe to a 2d numpy array 【发布时间】：2019-12-05 13:16:04 【问题描述】：

我有一个数据框 df，其中包含 x、y 列（均从 0 开始）和几个值列。 x 和 y 坐标不完整，这意味着许多 x-y 组合，有时完整的 x 或 y 值会丢失。我想创建一个具有完整形状矩阵 (df.x.max() + 1, (df.y.max()+1)) 的二维 numpy 数组，并将缺失值替换为 np.nan。 pd.pivot 已经非常接近了，但没有完全填充缺失的 x/y 值。

下面的代码已经实现了所需要的，但是由于for循环，这相当慢：

img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
col = 'value'
for ind, line in df.iterrows():
    img[line.x, line.y] = line[value]

一个明显更快的版本如下：

ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
temp = df.loc[readout].set_index(['x', 'y'])['value']
s_img.loc[temp.index] = temp
img = s_img.unstack().values

问题是是否存在可能使代码更短更快的矢量化方法。

提前感谢您的任何提示！

【问题讨论】：

【参考方案1】：

通常，填充 NumPy 数组的最快方法是简单地分配一个数组，然后分配值使用矢量化运算符或函数对其进行处理。在这种情况下，np.put 似乎很理想，因为它允许您使用（平面）索引数组和值数组来分配值。

nrows, ncols = df['x'].max() + 1, df['y'].max() +1
img = np.full((nrows, ncols), np.nan)
ind = df['x']*ncols + df['y']
np.put(img, ind, df['value'])

这是一个基准，它显示使用 np.put 可以比 alt 快 82 倍（unstacking 方法）用于制作 (100, 100) 形的结果数组：

In [184]: df = make_df(100,100)

In [185]: %timeit orig(df)
161 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [186]: %timeit alt(df)
31.2 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [187]: %timeit using_put(df)
378 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [188]: 31200/378
Out[188]: 82.53968253968254

这是用于基准测试的设置：

import numpy as np
import pandas as pd

def make_df(nrows, ncols):
    df = pd.DataFrame(np.arange(nrows*ncols).reshape(nrows, ncols))
    df.index.name = 'x'
    df.columns.name = 'y'
    ind_x = np.random.choice(np.arange(nrows), replace=False, size=nrows//2)
    ind_y = np.random.choice(np.arange(ncols), replace=False, size=ncols//2)
    df = df.drop(ind_x, axis=0).drop(ind_y, axis=1).stack().reset_index().rename(columns=0:'value')
    return df

def orig(df):
    img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
    col = 'value'
    for ind, line in df.iterrows():
        img[line.x, line.y] = line['value']
    return img

def alt(df):
    ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
    s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
    # temp = df.loc[readout].set_index(['x', 'y'])['value']
    temp = df.set_index(['x', 'y'])['value']
    s_img.loc[temp.index] = temp
    img = s_img.unstack().values
    return img

def using_put(df):
    nrows, ncols = df['x'].max() + 1, df['y'].max() +1
    img = np.full((nrows, ncols), np.nan)
    ind = df['x']*ncols + df['y']
    np.put(img, ind, df['value'])
    return img

或者，由于您的 DataFrame 是稀疏的，您可能有兴趣创建一个sparse matrix：

import scipy.sparse as sparse

def using_coo(df):
    nrows, ncols = df['x'].max() + 1, df['y'].max() +1    
    result = sparse.coo_matrix(
        (df['value'], (df['x'], df['y'])), shape=(nrows, ncols), dtype='float64')
    return result

正如人们所期望的那样，创建稀疏矩阵（从稀疏数据）比创建密集 NumPy 数组更快（并且需要更少的内存）：

In [237]: df = make_df(100,100)

In [238]: %timeit using_put(df)
381 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [239]: %timeit using_coo(df)
196 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [240]: 381/196
Out[240]: 1.9438775510204083

【讨论】：

以上是关于如何将稀疏的 pandas 数据帧转换为 2d numpy 数组的主要内容，如果未能解决你的问题，请参考以下文章