numpy 数组中非唯一行的快速组合，映射到列（即快速数据透视表问题，没有 Pandas）

Posted 2023-02-23

技术标签:

【中文标题】numpy 数组中非唯一行的快速组合，映射到列（即快速数据透视表问题，没有 Pandas）【英文标题】：Fast combination of non-unique rows in numpy array, mapped to columns (i.e. fast pivot table problem, without Pandas) 【发布时间】：2019-11-23 16:34:49 【问题描述】：

我想知道是否有人可以就以下编码问题提供任何想法或建议，我对快速 Python 实现特别感兴趣（即避免使用 Pandas）。

我有一组（虚拟示例）数据，例如：

|   User   |   Day   |   Place   |   Foo   |   Bar   |
      1         10        5          True     False
      1         11        8          True     False
      1         11        9          True     False
      2         11        9          True     False
      2         12        1          False    True
      1         12        2          False    True

包含 2 个用户（“user1”和“user2”）在给定日期/地点的数据，其中有 2 个感兴趣的布尔值（此处称为 foo 和 bar）。

我只对在同一天和同一地点为两个用户记录数据的情况感兴趣。有了这些相关的数据行，我想为将用户和 foo/bar 描述为布尔值的日期/地点条目创建新列。例如

|   Day   |   Place   |   User 1 Foo   |   User 1 Bar   |   User 2 Foo   |   User 2 Bar   |
    11           9          True            False              True           False

每列数据都存储在 numpy 数组中。我很欣赏这是熊猫的理想问题，使用数据透视表功能（例如熊猫解决方案是：

user = np.array([1, 1, 1, 2, 2, 1], dtype=int)
day = np.array([10, 11, 11, 11, 12, 12], dtype=int)
place = np.array([5,8,9,9,1,2], dtype=int)
foo = np.array([1, 1, 1, 1, 0, 0], dtype=bool)
bar = np.array([0, 0, 0, 0, 1, 1], dtype=bool) 

df = pd.DataFrame(
'user': user,
'day': day,
'place': place,
'foo': foo,
'bar': bar,
)
df2 = df.set_index(['day','place']).pivot(columns='user')

df2.columns = ["User1_foo", "User2_foo", "User1_bar", "User2_bar"]
df2 = df2.reset_index()
df2.dropna(inplace=True)

但在我的实际使用中，我有数百万行数据，分析表明数据帧的使用和数据透视操作是性能瓶颈。

因此，我怎样才能实现相同的输出，即天、地点和 user1_foo、user1_bar、user2_foo、user2_bar 的 numpy 数组，仅适用于两个用户在同一天都有数据并放在原始输入数组中的情况?

我想知道是否以某种方式从 np.unique 找到索引然后反转它们是一种可能的解决方案，但无法使其工作。因此，任何解决方案（最好是快速执行）都将非常感谢！

【问题讨论】：

我可以在音乐会回家后开始编写这个算法，我知道你想要的数学公式，它可以帮助你解决这个优化问题。那太棒了..谢谢。很高兴看到一个算法，甚至只是听到这个公式是什么。谢谢！不，这不可能发生。感谢您的许多回答-这很有用且很有启发性。已选择它作为答案，并会在 SO 让我在几个小时内授予“赏金”... @SLater01 赏金期结束还有 6 天。所以，我建议你将赏金转移推迟一点。这个问题得到更多关注。您可能会得到更好的答案，并且建议的答案也会受到关注。你有各种解决方案的计时结果吗？ 【参考方案1】：

方法#1

这是一个基于降维的内存效率和np.searchsorted 用于追溯并在两个用户数据之间寻找匹配的数据 -

# Extract array data for efficiency, as we will work NumPy tools
a = df.to_numpy(copy=False) #Pandas >= 0.24, use df.values otherwise
i = a[:,:3].astype(int)
j = a[:,3:].astype(bool)
# Test out without astype(int),astype(bool) conversions and see how they perform

# Get grouped scalars for Day and place headers combined
# This assumes that Day and Place data are positive integers
g = i[:,2]*(i[:,1].max()+1) + i[:,1]

# Get groups for user1,2 for original and grouped-scalar items
m1 = i[:,0]==1
uj1,uj2 = j[m1],j[~m1]
ui1 = i[m1]
u1,u2 = g[m1],g[~m1]

# Use searchsorted to look for matching ones between user-1,2 grouped scalars
su1 = u1.argsort()
ssu1_idx = np.searchsorted(u1,u2,sorter=su1)
ssu1_idx[ssu1_idx==len(u1)] = 0
ssu1_idxc = su1[ssu1_idx]

match_mask = u1[ssu1_idxc]==u2
match_idx = ssu1_idxc[match_mask]

# Select matching items off original table
p1,p2 = uj1[match_idx],uj2[match_mask]

# Setup output arrays
day_place = ui1[match_idx,1:]
user1_bools = p1
user2_bools = p2

方法 #1-扩展：通用 Day 和 Place dtype 数据

当Day 和Place 数据可能不一定是正整数时，我们可以扩展到通用情况。在这种情况下，我们可以利用 dtype-combined 基于视图的方法来执行数据缩减。因此，唯一需要改变的就是以不同的方式获得g，这将是一个基于视图的数组类型，并且会像这样获得 -

# https://***.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

# Get grouped scalars for Day and place headers combined with dtype combined view
g = view1D(i[:,1:])

方法 #2

我们将使用lex-sorting 对数据进行分组，以便在连续行中查找相同的元素会告诉我们两个用户之间是否存在匹配的元素。我们将从Approach#1 重新使用a,i,j。实施将是 -

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

b = i[sidx]

# Get matching conditions on consecutive rows
m = (np.diff(b,axis=0)==[1,0,0]).all(1)
# Or m = (b[:-1,1] == b[1:,1]) & (b[:-1,2] == b[1:,2]) & (np.diff(b[:,0])==1)

# Trace back to original order by using sidx
match1_idx,match2_idx = sidx[:-1][m],sidx[1:][m]

# Index into relevant table and get desired array outputs
day_place,user1_bools,user2_bools = i[match1_idx,1:],j[match1_idx],j[match2_idx]

或者，我们可以使用m 的扩展掩码来索引sidx 并生成match1_idx,match2_idx。其余代码保持不变。因此，我们可以做 -

from scipy.ndimage import binary_dilation

# Binary extend the mask to have the same length as the input.
# Index into sidx with it. Use one-off offset and stepsize of 2 to get
# user1,2 matching indices
m_ext = binary_dilation(np.r_[m,False],np.ones(2,dtype=bool),origin=-1)
match_idxs = sidx[m_ext]
match1_idx,match2_idx = match_idxs[::2],match_idxs[1::2]

方法#3

这是另一个基于Approach #2 并移植到numba 的内存和性能。效率，我们将从approach #1重用a,i,j -

from numba import njit

@njit
def find_groups_numba(i_s,j_s,user_data,bools):
    n = len(i_s)
    found_iterID = 0
    for iterID in range(n-1):
        if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
            bools[found_iterID,0] = j_s[iterID,0]
            bools[found_iterID,1] = j_s[iterID,1]
            bools[found_iterID,2] = j_s[iterID+1,0]
            bools[found_iterID,3] = j_s[iterID+1,1]
            user_data[found_iterID,0] = i_s[iterID,1]
            user_data[found_iterID,1] = i_s[iterID,2]        
            found_iterID += 1
    return found_iterID

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

i_s = i[sidx]
j_s = j[sidx]

n = len(i_s)
user_data = np.empty((n//2,2),dtype=i.dtype)
bools = np.empty((n//2,4),dtype=j.dtype)    
found_iterID = find_groups_numba(i_s,j_s,user_data,bools)    
out_bools = bools[:found_iterID] # Output bool
out_userd = user_data[:found_iterID] # Output user-Day, Place data

如果输出必须有自己的内存空间，则在最后两步附加 .copy()。

或者，我们可以将索引操作卸载到 NumPy 端以获得更简洁的解决方案 -

@njit
def find_consec_matching_group_indices(i_s,idx):
    n = len(i_s)
    found_iterID = 0
    for iterID in range(n-1):
        if i_s[iterID,1] == i_s[iterID+1,1] and i_s[iterID,2] == i_s[iterID+1,2]:
            idx[found_iterID] = iterID
            found_iterID += 1            
    return found_iterID

# Lexsort the i table
sidx = np.lexsort(i.T)
# OR sidx = i.dot(np.r_[1,i[:,:-1].max(0)+1].cumprod()).argsort()

i_s = i[sidx]
j_s = j[sidx]

idx = np.empty(len(i_s)//2,dtype=np.uint64)
found_iterID = find_consec_matching_group_indices(i_s,idx)
fidx = idx[:found_iterID]
day_place,user1_bools,user2_bools = i_s[fidx,1:],j_s[fidx],j_s[fidx+1]

【讨论】：

【参考方案2】：

另一种选择 - 通过 ['day','place'] 查找重复行，这将仅过滤常见的行。然后通过 'user' 进行旋转。更改列名并重新索引。

代码：

import pandas as pd
import numpy as np
user = np.array([1, 1, 1, 2, 2, 1], dtype=int)
day = np.array([10, 11, 11, 11, 12, 12], dtype=int)
place = np.array([5,8,9,9,1,2], dtype=int)
foo = np.array([1, 1, 1, 1, 0, 0], dtype=bool)
bar = np.array([0, 0, 0, 0, 1, 1], dtype=bool)

df = pd.DataFrame(
'user': user,
'day': day,
'place': place,
'foo': foo,
'bar': bar,
)

df1=df[df.duplicated(['day','place'],keep=False)]\
    .set_index(['day','place']).pivot(columns='user')
name = df1.columns.names[1]
df1.columns = ['_'.format(name, col[1], col[0]) for col in df1.columns.values]
df1 = df1.reset_index()

输出：

   day  place  user1_foo  user2_foo  user1_bar  user2_bar
0   11      9       True       True      False      False

【讨论】：

df.duplicated(['day','place'],keep=False)] 已包含在我的答案中，您只添加了 df.pivot OP 已经使用过也许吧，但是当我添加那个时候我还没有看到。谢谢。【参考方案3】：

这确实使用了 pandas，但它仍然可能会有所帮助。首先，也许首先进行搜索和删除，所有没有重复日期和位置值的行都可以加快速度。例如，运行df2=df[df.duplicated(['day','place'],keep=False)] 将删除具有唯一日期和地点对的每一行。我不确定您的数据是什么样的，但这可能会显着减少您拥有的数据量。对于您给出的示例，这行代码输出

   user  day  place   foo    bar
2     1   11      9  True  False
3     2   11      9  True  False

经过修剪后，可以进行简化的数据提取。现在，下面的代码只有在我们知道一个用户不会有任何重复的地点和日期条目并且用户总是第一位的情况下才有效。

def every_other_row(df): 
    first=df.iloc[::2, :]
    second=df.iloc[1::2, :]
    first['foo user 2']=second['foo'].astype(bool)
    first['bar user 2']=second['bar'].astype(bool)

    return first

条件非常具体，但我包含了这个选项，因为当我在具有一百万行的 DataFrame 上运行此代码时需要 0.289 秒

现在，对于更广泛的情况，您可以运行类似这样的操作

df_user1=df.loc[df['user'] == 1] 
df_user2=df.loc[df['user'] == 2] 
df_user2=df_user2.rename(index=str, columns="foo": "foo user 2", "bar": "bar user 2")

new=df_user1.merge(df_user2,on=['day','place'])

在 450 万行上运行此程序需要 3.8 秒，但这取决于有多少行是唯一的并且需要合并。我使用随机数来生成我的 DataFrame，所以可能需要组合的数据较少。

【讨论】：

感谢您的回答，尽管使用 pandas 仍然有很大的开销，即使通过在执行操作之前删除行来减少一点。从在线阅读来看，对于大多数操作来说，使用“apply”似乎确实比向量化的 numpy 数组操作要慢，因此我对 numpy 解决方案很感兴趣嗯，pandas 是基于 numpy 的，所以我不知道为什么它一定会更慢。枢轴操作当然很慢，但可能还有其他东西。如果我能找到一些东西，我会进行更多调查并编辑我的答案【参考方案4】：

这是一个带有set 交叉点的普通pythonic 解决方案：

import numpy as np
import pandas as pd

user = np.array([1, 1, 1, 2, 2, 1], dtype=int)
day = np.array([10, 11, 11, 11, 12, 12], dtype=int)
place = np.array([5,8,9,9,1,2], dtype=int)
foo = np.array([1, 1, 1, 1, 0, 0], dtype=bool)
bar = np.array([0, 0, 0, 0, 1, 1], dtype=bool) 

# create a set of day/paces for user1
user1_dayplaces =  
   (day[row_id], place[row_id])
   for row_id, user_id in enumerate(user)
   if user_id == 1


# create a set of day/paces for user2
user2_dayplaces =  
   (day[row_id], place[row_id])
   for row_id, user_id in enumerate(user)
   if user_id == 2


# intersecting two sets to get the intended day/places
shared_dayplaces = user1_dayplaces & user2_dayplaces

# use day/places as a filter to get the intended row number
final_row_ids = [
   row_id
   for row_id, user_id in enumerate(user)
   if (day[row_id], place[row_id]) in shared_dayplaces
]

# filter the data with finalised row numbers to create the intended dataframe:
df = pd.DataFrame(
   'user':  user[final_row_ids],
   'day':   day[final_row_ids],
   'place': place[final_row_ids],
   'foo':   foo[final_row_ids],
   'bar':   bar[final_row_ids],
, final_row_ids) # setting the index in this like is only for keeping the original index numbers.

结果df是：

   user  day  place   foo    bar
2     1   11      9  True  False
3     2   11      9  True  False

【讨论】：

以上是关于numpy 数组中非唯一行的快速组合，映射到列（即快速数据透视表问题，没有 Pandas）的主要内容，如果未能解决你的问题，请参考以下文章