Python numpy：为3个其他列的每个唯一元组有效地获取包含min值的行

Question

我有一些数据存储在列表列表中（约200,000行×6列的列表）。

我需要得到以下数据子集：对于列[1,2,4]中的每个唯一值集，我需要找到列0的最小值并且仅保留该行。

我必须在旧的numpy 1.10中执行此操作（不要问...），因此np.unique（）中没有'axis = 0'选项。

以下示例运行并生成正确的输出，但速度非常慢。这看起来很基本，所以我觉得（缺乏）速度一定是我的错。

# S-L-O-W way to get the desired output:
import numpy as np

# Example dataset
data = [[1, 1, 1, 'a', 1],
        [0, 1, 1, 'b', 1],
        [0, 3, 1, 'c', 4],
        [3, 1, 1, 'd', 1],
        [4, 3, 1, 'e', 4]]

desired_output = [[0, 1, 1, 'b', 1],
                  [0, 3, 1, 'c', 4]]

# Currently coding on a geriatric machine with numpy pre-version 1.13 and no ability to upgrade,
# so np.unique() won't take an axis argument. The next few hack lines of code get around this with strings...
tuples_str = []
tuples_raw = [[datarow[jj] for jj in [1,2,4]]  for datarow in data ]
for datarow in data:
    one_tuple = [datarow[jj] for jj in [1,2,4]]
    tuples_str.append( '_'.join([str(ww) for ww in one_tuple]) )

# Numpy unique on this data subset with just columns [1,2,4] of original data
unq, unq_inv, unq_cnt = np.unique(tuples_str, return_inverse=True, return_counts=True)

# Storage
output = []

# Here's the painfully slow part:
# Iterate over each subset of data where rows take the value in one unique tuple (i.e. columns [1,2,4] are identical)
for ii, idx in enumerate(np.unique(unq_inv)):

    # Get the rows that have the same values in columns [1,2,4]
    all_matches_thistuple = [row for ii, row in enumerate(data) if unq_inv[ii]==idx]

    # Find the index of the row with the minimum value for column 0
    first_line_min_idx = np.argmin([int(row1[0]) for row1 in all_matches_thistuple])

    # Save only that row
    output.append(all_matches_thistuple[first_line_min_idx])
print(output)

Answer 1

另一答案