改组/排列熊猫中的DataFrame

Posted 2023-03-11

技术标签:

【中文标题】改组/排列熊猫中的DataFrame【英文标题】：shuffling/permutating a DataFrame in pandas 【发布时间】：2013-03-24 05:07:34 【问题描述】：

在 pandas 中按行或按列对数据帧进行洗牌的简单而有效的方法是什么？ IE。如何编写一个函数shuffle(df, n, axis=0)，它接受一个数据帧、多个洗牌n 和一个轴（axis=0 是行，axis=1 是列）并返回已洗牌的数据帧的副本@987654326 @时代。

编辑：关键是在不破坏数据框的行/列标签的情况下执行此操作。如果您只是洗牌df.index，则会丢失所有这些信息。我希望得到的 df 与原始结果相同，只是行的顺序或列的顺序不同。

Edit2：我的问题不清楚。当我说洗牌时，我的意思是独立洗牌每一行。因此，如果您有两列 a 和 b，我希望每一行都自行改组，这样您就不会像重新排序那样在 a 和 b 之间拥有相同的关联每一行作为一个整体。类似的东西：

for 1...n:
  for each col in df: shuffle column
return new_df

但希望比简单循环更有效。这对我不起作用：

def shuffle(df, n, axis=0):
        shuffled_df = df.copy()
        for k in range(n):
            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
        return shuffled_df

df = pandas.DataFrame('A':range(10), 'B':range(10))
shuffle(df, 5)

【问题讨论】：

See this simple pandas solution below ^ 你的回答确实回答了这个问题，但似乎不是人们正在寻找的答案 【参考方案1】：

使用numpy的random.permuation函数：

In [1]: df = pd.DataFrame('A':range(10), 'B':range(10))

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4

【讨论】：

+1 因为这正是我想要的（尽管事实证明这不是 OP 想要的）如果有骗子之类的，也可以使用df.iloc[np.random.permutation(np.arange(len(df)))]（对于mi来说可能更快）。不错的方法。有没有办法就地做到这一点？对我来说（Python v3.6 和 Pandas v0.20.1）我必须将 df.reindex(np.random.permutation(df.index)) 替换为 df.set_index(np.random.permutation(df.index)) 以获得所需的效果。在set_index 像Emanuel 之后，我还需要df.sort_index(inplace=True)【参考方案2】：

In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame('A':range(10), 'B':range(10))

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

【讨论】：

这里如何区分行和列洗牌？警告我认为df.apply(np.random.permutation) 可以作为df.reindex(np.random.permutation(df.index)) 的解决方案，看起来更整洁，但实际上它们的行为不同。后者维护同一行的列之间的关联，前者没有。当然是我的误解，但希望它能帮助其他人避免同样的错误。在这种情况下什么是“np”？ numpy.常做：import numpy as np 我只想做一个随机播放，所以我只使用了df.apply(np.random.shuffle, index=1) 但这似乎没有任何作用，打印结果 df 看起来与输入完全相同。如果我做df = df.apply( ... ) 我得到一个带有Nans. 的系列如果我做df.apply( ... inplace=True) 然后我得到一个错误。【参考方案3】：

我稍微调整了@root 的答案并直接使用原始值。当然，这意味着您失去了进行精美索引的能力，但它非常适合仅打乱数据。

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame("A": range(10), "B": range(10))    

In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop

In [5]: %%timeit
   ...: for view in numpy.rollaxis(df.values, 1):
   ...:     numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 22.8 µs per loop

In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop

In [7]: %%timeit                                      
for view in numpy.rollaxis(df.values, 0):
    numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 23.4 µs per loop

注意numpy.rollaxis将指定的轴带到第一个维度，然后让我们用剩余维度迭代数组，即如果我们想沿着第一个维度（列）打乱，我们需要滚动第二个维度到前面，以便我们将洗牌应用于第一个维度上的视图。

In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)

In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)

然后，您的最终函数使用技巧使结果符合将函数应用于轴的预期：

def shuffle(df, n=1, axis=0):     
    df = df.copy()
    axis = int(not axis) # pandas.DataFrame is always 2D
    for _ in range(n):
        for view in numpy.rollaxis(df.values, axis):
            numpy.random.shuffle(view)
    return df

【讨论】：

【参考方案4】：

当您想要随机排列索引时，这可能会更有用。

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

它使用新索引选择新的 df，然后重置它们。

【讨论】：

【参考方案5】：

从文档中使用sample()：

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64

【讨论】：

【参考方案6】：

抽样是随机的，所以只需对整个数据帧进行抽样。

df.sample(frac=1)

正如@Corey Levinson 所说，重新分配时必须小心：

df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

【讨论】：

请注意，如果您尝试使用此重新分配列，则必须执行df['column'] = df['column'].sample(frac=1).reset_index(drop=True)【参考方案7】：

如果您只想随机播放 DataFrame 的子集，我发现了一个解决方法：

shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

【讨论】：

【参考方案8】：

您可以使用sklearn.utils.shuffle()（requiressklearn 0.16.1 或更高版本支持 Pandas 数据帧）：

# Generate data
import pandas as pd
df = pd.DataFrame('A':range(5), 'B':range(5))
print('df: 0'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: 0'.format(df))

输出：

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

那么你可以使用df.reset_index()重置索引列，如果需要的话：

df = df.reset_index(drop=True)
print('\n\ndf: 0'.format(df)

输出：

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3

【讨论】：

仅供参考，df.sample(frac=1) 稍微快一些（400k 行为 76.9 对 78.9 毫秒）。【参考方案9】：

我知道问题是针对pandas df，但是如果按行发生随机播放（列顺序已更改，行顺序未更改），那么列名不再重要，使用 @ 可能会很有趣改为 987654322@，然后 np.apply_along_axis() 将是您要查找的内容。

如果这是可以接受的，那么这将是有帮助的，注意很容易切换数据被打乱的轴。

如果你的熊猫数据框被命名为df，也许你可以：

values = df.values

values

np.array

原始数组

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

保持行顺序，每行内的列洗牌

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

保持列顺序，在每列中打乱行

print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
 [20 31 42]
 [10 11 12]
 [30 21 22]]

原数组不变

print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

【讨论】：

【参考方案10】：

pandas 中的一个简单解决方案是在每一列上独立使用sample 方法。使用apply 遍历每一列：

df = pd.DataFrame('a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6])
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

您必须使用 .value 以便返回一个 numpy 数组而不是一个系列，否则返回的系列将与原始 DataFrame 对齐而不改变任何东西：

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

【讨论】：

以上是关于改组/排列熊猫中的DataFrame的主要内容，如果未能解决你的问题，请参考以下文章

改组 JSON 数组中的列表

R根据条件更改组中的最小值

改组数据框中的多列

熊猫数据框重新排列堆栈到两个值列（用于因子图）

使用遵循的规则从数据框中排列给定值 - 熊猫

在熊猫中按一列随机排列行