自然排序 Pandas DataFrame

Posted 2023-03-11

技术标签:

【中文标题】自然排序 Pandas DataFrame【英文标题】：Naturally sorting Pandas DataFrame 【发布时间】：2015-06-17 07:49:56 【问题描述】：

我有一个带有索引的 pandas DataFrame，我想自然地对其进行排序。 Natsort 似乎不起作用。在构建 DataFrame 之前对索引进行排序似乎没有帮助，因为我对 DataFrame 所做的操作似乎弄乱了过程中的排序。关于如何自然地使用索引有什么想法吗？

from natsort import natsorted
import pandas as pd

# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted 
c = natsorted(a)

# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)

print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)

【问题讨论】：

@sethMMorton 我想我希望df3.index 与c 相同，同时对数据进行排序以使其与其索引值保持一致如果pd.sort 有一个key 选项会很好，但它没有。 This answer 提供了一种解决方法，可让您传递从 natsort_keygen 生成的密钥。我刚刚向pandas 开发人员提出正式请求，将key 添加到sort 方法中：github.com/pydata/pandas/issues/9855 我上面的问题是骗子，活动问题是github.com/pydata/pandas/issues/3942 现在pandas 有一个key 参数到sort_values，***.com/a/63890954/1399279 现在应该是公认的答案。 【参考方案1】：

如果您想对 df 进行排序，只需对索引或数据进行排序并直接分配给 df 的索引，而不是尝试将 df 作为 arg 传递，因为这会产生一个空列表：

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

请注意，df.index = natsorted(df.index) 也可以使用

如果您将 df 作为 arg 传递，它会产生一个空列表，在这种情况下，因为 df 是空的（没有列），否则它将返回排序的列，这不是您想要的：

In [10]:

natsorted(df)
Out[10]:
[]

编辑

如果您想对索引进行排序，以便数据与索引一起重新排序，请使用reindex：

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
       0
0hr    0
128hr  1
72hr   2
48hr   3
96hr   4
In [14]:

df = df*2
df
Out[14]:
       0
0hr    0
128hr  2
72hr   4
48hr   6
96hr   8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
       0
0hr    0
48hr   6
72hr   4
96hr   8
128hr  2

请注意，您必须将 reindex 的结果分配给新的 df 或自身，它不接受 inplace 参数。

【讨论】：

嗨，natsort 开发人员在这里。 natsort 目前对处理整个数据框对象没有任何明确的支持。传递数据框对象的预期输出是什么？我认为这没有抓住重点。我意识到我可以自然地对 a 进行排序并将其用作索引，但是由于我对数据帧执行的操作，我的实际代码弄乱了数据帧索引的排序。我需要在数据框中使用索引和关联数据。那么你在这里问的是什么，你想在数据操作后对索引进行 natsort 排序？您可以使用reindex 并在索引df.reindex(index=natsorted(df.index)) 上调用natsorted @EdChum 是的，这听起来正是他们想要的。我认为最终这是正确的答案。 @SethMMorton 抱歉 reindex 是少数不接受参数 inplace 的函数之一，所以是的，您必须将其分配给自己【参考方案2】：

现在`pandas` 在`sort_values` 和`sort_index` 中都支持`key`，您现在应该参考this other answer 并将所有赞成票发送到那里，因为它现在是正确的答案。

我会将我的答案留在这里，以供那些停留在旧 pandas 版本的人，或者作为历史的好奇心。

accepted answer 回答了所提出的问题。我还想添加如何在 DataFrame 中的列上使用 natsort，因为这将是下一个问题。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted, index_natsorted, order_by_index

In [3]: df = DataFrame('a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1'], index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df
Out[4]: 
         a   b
0hr     a5  b1
128hr   a1  b1
72hr   a10  b2
48hr    a2  b2
96hr   a12  b1

正如accepted answer 所示，按索引排序非常简单：

In [5]: df.reindex(index=natsorted(df.index))
Out[5]: 
         a   b
0hr     a5  b1
48hr    a2  b2
72hr   a10  b2
96hr   a12  b1
128hr   a1  b1

如果您想以相同的方式对列进行排序，则需要按照所需列的重新排序顺序对索引进行排序。 natsort 提供了方便的函数 index_natsorted 和 order_by_index 来做到这一点。

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

如果要按任意数量的列（或列和索引）重新排序，可以使用zip（或 Python2 上的 itertools.izip）指定对多列进行排序。给定的第一列将是主要排序列，然后是次要列，然后是第三列，等等......

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]: 
         a   b
0hr     a5  b1
96hr   a12  b1
128hr   a1  b1
48hr    a2  b2
72hr   a10  b2

pandas 开发人员告诉我，这是一种使用 Categorical 对象的替代方法，这是执行此操作的“正确”方法。这需要（据我所知）pandas >= 0.16.0。目前，它仅适用于列，但显然在 pandas >= 0.17.0 中它们将添加 CategoricalIndex，这将允许在索引上使用此方法。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame('a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1'], index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

Categorical 对象允许您定义DataFrame 使用的排序顺序。调用 reorder_categories 时给出的元素必须是唯一的，因此对列“b”的调用 set。

我让用户来决定这是否比reindex 方法更好，因为它要求您在DataFrame 中排序之前独立地对列数据进行排序（尽管我认为第二次排序是高效）。

完全披露，我是natsort作者。

【讨论】：

【参考方案3】：

将`sort_values` 用于`pandas >= 1.1.0`

使用DataFrame.sort_values 中的新key 参数，由于pandas 1.1.0，我们可以直接对列进行排序，而无需使用natsort.natsort_keygen 将其设置为索引：

df = pd.DataFrame(
    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
    "value": [10, 20, 30, 40, 50]
)

    time  value
0    0hr     10
1  128hr     20
2   72hr     30
3   48hr     40
4   96hr     50

from natsort import natsort_keygen

df.sort_values(
    by="time",
    key=natsort_keygen()
)

    time  value
0    0hr     10
3   48hr     40
2   72hr     30
4   96hr     50
1  128hr     20

【讨论】：

这个提议的解决方案有点“最大努力”的解决方案 - key=natsort_keygen() 会不会少一些努力？同意，相应地更新了我的答案。感谢您写的提醒和漂亮的包裹:) @SethMMorton 如果我尝试对 2 列不同类型的列进行排序，例如 df.sort_values(['Title', 'Copies'], ascending=[False, True], key=natsort_keygen())，我会收到此错误 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()。我从pd.read_csv 获得了数据框，提供了列名和类型。知道如何解决吗？

以上是关于自然排序 Pandas DataFrame的主要内容，如果未能解决你的问题，请参考以下文章

自然排序 Pandas DataFrame

现在pandas 在sort_values 和sort_index 中都支持key，您现在应该参考this other answer 并将所有赞成票发送到那里，因为它现在是正确的答案。

将sort_values 用于pandas &gt;= 1.1.0

现在`pandas` 在`sort_values` 和`sort_index` 中都支持`key`，您现在应该参考this other answer 并将所有赞成票发送到那里，因为它现在是正确的答案。

将`sort_values` 用于`pandas >= 1.1.0`