简单的多维 numpy ndarray 到 pandas 数据框的方法？

Posted 2023-03-12

技术标签:

【中文标题】简单的多维 numpy ndarray 到 pandas 数据框的方法？【英文标题】：easy multidimensional numpy ndarray to pandas dataframe method? 【发布时间】：2016-08-19 14:42:17 【问题描述】：

有一个 4-D numpy.ndarray，例如

myarr = np.random.rand(10,4,3,2) dims='time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']

但可能有更高的维度。如何创建具有多索引的 pandas.dataframe，只需将维度作为索引传递，无需进一步手动调整（将 ndarray 重塑为 2D 形状）？

我无法完全理解重塑，甚至还没有真正在3 dimensions 中，所以如果可能的话，我正在寻找一种“自动”方法。

传递列/行索引并创建数据框的函数是什么？比如：

df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])

还有类似的东西：

              meas1             meas2
              A     B     C     A    B    C
sub   time
  1      1
         2
         3
         .
         .
  2      1
         2
 ...

如果不可能/不可行将其自动化，则表示赞赏比Multiindexing manual 更简洁的解释。

当我不关心维度的顺序时，我什至无法做到正确，例如我希望这会起作用：

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])



pd.DataFrame(a.reshape(2*3*1,2*2),index)

给出：

ValueError: Shape of passed values is (4, 6), indices imply (4, 24)

【问题讨论】：

【参考方案1】：

您收到错误是因为您已将 ndarray 重塑为 6x4 并应用了旨在捕获单个系列中所有维度的索引。以下是使宠物示例正常工作的设置：

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])

pd.DataFrame(a.reshape(24, 1),index=index)

解决方案

这是一个通用的 DataFrame 创建器，应该可以完成工作：

def produce_df(rows, columns, row_names=None, column_names=None):
    """rows is a list of lists that will be used to build a MultiIndex
    columns is a list of lists that will be used to build a MultiIndex"""
    row_index = pd.MultiIndex.from_product(rows, names=row_names)
    col_index = pd.MultiIndex.from_product(columns, names=column_names)
    return pd.DataFrame(index=row_index, columns=col_index)

演示

没有命名的索引级别

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])

       1         2     
       3    4    3    4
a c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN
b c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN

具有命名的索引级别

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
           row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])

number1          1         2     
number2          3    4    3    4
alpha1 alpha2                    
a      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN
b      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN

【讨论】：

现在我看到了关于额外维度的错误，谢谢。漂亮的小功能！【参考方案2】：

从你的数据结构来看，

names=['sub','time','measure','cond']  #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]

实现目标的直接方法：

index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()

df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])


"""
measure  meas1         meas2        
cond         A   B   C     A   B   C
sub time                            
1   1        0   1   2     3   4   5
    2        6   7   8     9  10  11
2   1       12  13  14    15  16  17
    2       18  19  20    21  22  23
3   1       24  25  26    27  28  29
    2       30  31  32    33  34  35

"""

【讨论】：

仍然有点简洁，脱离了具体问题，但也很有帮助，谢谢我已经适应了一种更有用、更清晰的 (?) 方法。酷，不知道pivot_table方法！【参考方案3】：

我还是不知道怎么直接做，但是这里有一个易于遵循的逐步方法：

# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)

检查尺寸的正确分配：

print(a[:,0,0,0])
[ 0  8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]

print(b)
meas      m1      m2    
cond       A   B   A   B
time sub                
1    1     0   1   2   3
     2     4   5   6   7
2    1     8   9  10  11
     2    12  13  14  15
3    1    16  17  18  19
     2    20  21  22  23

print(c)
meas      m1      m2    
cond       A   B   A   B
sub time                
1   1      0   1   2   3
    2      8   9  10  11
    3     16  17  18  19
2   1      4   5   6   7
    2     12  13  14  15
    3     20  21  22  23

【讨论】：

以上是关于简单的多维 numpy ndarray 到 pandas 数据框的方法？的主要内容，如果未能解决你的问题，请参考以下文章