多列上的 Numpy 排序 ndarray

Posted 2023-02-23

技术标签:

【中文标题】多列上的 Numpy 排序 ndarray【英文标题】：Numpy sort ndarray on multiple columns 【发布时间】：2015-06-03 20:10:14 【问题描述】：

我得到一个 ndarray 从文件中读取它，像这样

my_data = np.genfromtxt(input_file, delimiter='\t', skip_header=0)

示例输入（已解析）

[[   2.    1.    2.    0.]
 [   2.    2.  100.    0.]
 [   2.    3.  100.    0.]
 [   3.    1.    2.    0.]
 [   3.    2.    4.    0.]
 [   3.    3.    6.    0.]
 [   4.    1.    2.    0.]
 [   4.    2.    4.    0.]
 [   4.    3.    6.    0.]]

更长的example input（未解析）。

前两列应该是int，而后两列应该是float，但这就是我得到的。欢迎提出建议。

主要问题是，我正在尝试使用 Numpy 对其进行排序，以便对行进行排序，优先考虑第二列上的数字，然后是第一列上的数字。

所需输出示例

[[   2.    1.    2.    0.]
 [   3.    1.    2.    0.]
 [   4.    1.    2.    0.]
 [   2.    2.  100.    0.]
 [   3.    2.    4.    0.]
 [   4.    2.    4.    0.]
 [   2.    3.  100.    0.]
 [   3.    3.    6.    0.]
 [   4.    3.    6.    0.]]

我知道this answer，它适用于对单列上的行进行排序。

我尝试对第二列进行排序，因为第一列已经排序，但这还不够。有时，第一列也会被重新排序，非常糟糕。

new_data = my_data[my_data[:, 1].argsort()]
print(new_data)

#output
[[   2.    1.    2.    0.]
 [   4.    1.    2.    0.] #ouch
 [   3.    1.    2.    0.] #ouch
 [   2.    2.  100.    0.]
 [   3.    2.    4.    0.]
 [   4.    2.    4.    0.]
 [   2.    3.  100.    0.]
 [   3.    3.    6.    0.]
 [   4.    3.    6.    0.]]

我也查了this question

答案提到

这里的问题是 np.lexsort 或 np.sort 不适用于 dtype 对象的数组。为了解决这个问题，您可以在创建 order_list 之前对 rows_list 进行排序：

import operator
rows_list.sort(key=operator.itemgetter(0,1,2))

但我在sort 类型的ndarray 函数中没有key 参数。在我的情况下，合并字段不是一种选择。

另外，我没有标题，所以，如果我尝试使用 order 参数进行排序，我会收到错误。

ValueError: Cannot specify order when the array has no fields.

我宁愿就地排序或至少获得相同类型的结果ndarray。然后我想把它保存到一个文件中。

如何在不弄乱数据类型的情况下做到这一点？

【问题讨论】：

【参考方案1】：

numpy ndarray 按第 1、2 或 3 列排序：

>>> a = np.array([[1,30,200], [2,20,300], [3,10,100]])

>>> a
array([[  1,  30, 200],         
       [  2,  20, 300],          
       [  3,  10, 100]])

>>> a[a[:,2].argsort()]           #sort by the 3rd column ascending
array([[  3,  10, 100],
       [  1,  30, 200],
       [  2,  20, 300]])

>>> a[a[:,2].argsort()][::-1]     #sort by the 3rd column descending
array([[  2,  20, 300],
       [  1,  30, 200],
       [  3,  10, 100]])

>>> a[a[:,1].argsort()]        #sort by the 2nd column ascending
array([[  3,  10, 100],
       [  2,  20, 300],
       [  1,  30, 200]])

解释这里发生了什么：argsort() 正在传回一个包含其父级整数序列的数组： https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

>>> x = np.array([15, 30, 4, 80, 6])
>>> np.argsort(x)
array([2, 4, 0, 1, 3])

按第 3 列排序，然后按第 2 列，再按 1：

>>> a = np.array([[2,30,200], [1,30,200], [1,10,200]])

>>> a
array([[  2,  30, 200],
       [  1,  30, 200],
       [  1,  10, 200]])

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))]
array([[  1,  10, 200],
       [  1,  30, 200],
       [  2,  30, 200]])

同上，但反过来：

>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))][::-1]
array([[  2  30 200]
       [  1  30 200]
       [  1  10 200]])

【讨论】：

是否可以用a[a[:,2].argsort()[::-1]] 来代替a[a[:,2].argsort()][::-1]？会不会更有效率？【参考方案2】：

Import 让 Numpy 猜测类型并就地排序：

import numpy as np

# let numpy guess the type with dtype=None
my_data = np.genfromtxt(infile, dtype=None, names=["a", "b", "c", "d"])

# access columns by name
print(my_data["b"]) # column 1

# sort column 1 and column 0 
my_data.sort(order=["b", "a"])

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", my_data, fmt="%d\t%d\t%.6f\t%.6f"

或者，指定输入格式并排序到新数组：

import numpy as np

# tell numpy the first 2 columns are int and the last 2 are floats
my_data = np.genfromtxt(infile, dtype=[('a', '<i8'), ('b', '<i8'), ('x', '<f8'), ('d', '<f8')])

# access columns by name
print(my_data["b"]) # column 1

# get the indices to sort the array using lexsort
# the last element of the tuple (column 1) is used as the primary key
ind = np.lexsort((my_data["a"], my_data["b"]))

# create a new, sorted array
sorted_data = my_data[ind]

# save specifying required format (tab separated values)
np.savetxt("sorted.tsv", sorted_data, fmt="%d\t%d\t%.6f\t%.6f")

输出：

2   1   2.000000    0.000000
3   1   2.000000    0.000000
4   1   2.000000    0.000000
2   2   100.000000  0.000000
3   2   4.000000    0.000000
4   2   4.000000    0.000000
2   3   100.000000  0.000000
3   3   6.000000    0.000000
4   3   6.000000    0.000000

【讨论】：

【参考方案3】：

使用np.lexsort，您可以同时基于多个列进行排序。您要排序的列需要反向传递。这意味着np.lexsort((col_b,col_a)) 首先按 col_a 排序，然后按 col_b 排序：

my_data = np.array([[   2.,    1.,    2.,    0.],
                    [   2.,    2.,  100.,    0.],
                    [   2.,    3.,  100.,    0.],
                    [   3.,    1.,    2.,    0.],
                    [   3.,    2.,    4.,    0.],
                    [   3.,    3.,    6.,    0.],
                    [   4.,    1.,    2.,    0.],
                    [   4.,    2.,    4.,    0.],
                    [   4.,    3.,    6.,    0.]])

ind = np.lexsort((my_data[:,0],my_data[:,1]))
my_data[ind]

结果：

array([[  2.,   1.,   2.,   0.],
       [  3.,   1.,   2.,   0.],
       [  4.,   1.,   2.,   0.],
       [  2.,   2., 100.,   0.],
       [  3.,   2.,   4.,   0.],
       [  4.,   2.,   4.,   0.],
       [  2.,   3., 100.,   0.],
       [  3.,   3.,   6.,   0.],
       [  4.,   3.,   6.,   0.]])

如果你知道你的第一列已经排序，你可以使用：

ind = my_data[:,1].argsort(kind='stable')
my_data[ind]

这可确保为相同的项目保留订单。通常使用的快速排序算法不会这样做，尽管它更快。

【讨论】：

您使用的 my_data 是否与此处其他示例中使用的相同？如果是这样，请将其粘贴为输入以完成您的答案。谢谢。是的。我添加了输入。【参考方案4】：

此方法适用于任何 numpy 数组：

import numpy as np

my_data = [[   2.,    1.,    2.,    0.],
           [   2.,    2.,  100.,    0.],
           [   2.,    3.,  100.,    0.],
           [   3.,    1.,    2.,    0.],
           [   3.,    2.,    4.,    0.],
           [   3.,    3.,    6.,    0.],
           [   4.,    1.,    2.,    0.],
           [   4.,    2.,    4.,    0.],
           [   4.,    3.,    6.,    0.]]
my_data = np.array(my_data)
r = np.core.records.fromarrays([my_data[:,1],my_data[:,0]],names='a,b')
my_data = my_data[r.argsort()]
print(my_data)

结果：

[[  2.   1.   2.   0.]
 [  3.   1.   2.   0.]
 [  4.   1.   2.   0.]
 [  2.   2. 100.   0.]
 [  3.   2.   4.   0.]
 [  4.   2.   4.   0.]
 [  2.   3. 100.   0.]
 [  3.   3.   6.   0.]
 [  4.   3.   6.   0.]]

【讨论】：

你的输入和输出看起来一样，这里排序的是什么？哎呀忘了从我的代码 sn-p 交换 [my_data[:,0],my_data[:,1]] 以匹配要求的 1,0 订单。谢谢，已更新。

以上是关于多列上的 Numpy 排序 ndarray的主要内容，如果未能解决你的问题，请参考以下文章