两个 pandas MultiIndex 帧将每一行与每一行相乘

Posted

技术标签:

【中文标题】两个 pandas MultiIndex 帧将每一行与每一行相乘【英文标题】:Two pandas MultiIndex frames multiply every row with every row 【发布时间】:2017-07-18 14:54:58 【问题描述】:

我需要将两个具有相同***别索引的 MultiIndexed 帧(例如 df1, df2)相乘,这样对于每个***别索引,df1 的每一行都与 df2 的每一行元素相乘。我已经实现了以下示例,它可以满足我的要求,但它看起来很丑:

a = ['alpha', 'beta']
b = ['A', 'B', 'C']
c = ['foo', 'bar']
df1 = pd.DataFrame(np.random.randn(6, 4),
                   index=pd.MultiIndex.from_product(
                       [a, b], 
                       names=['greek', 'latin']),
                   columns=['C1', 'C2', 'C3', 'C4'])
df2 = pd.DataFrame(
    np.array([[1, 0, 1, 0], [1, 1, 1, 1], [0, 0, 0, 0], [0, 2, 0, 4]]),
    index=pd.MultiIndex.from_product([a, c], names=['greek', 'foobar']),
    columns=['C1', 'C2', 'C3', 'C4'])

df3 = pd.DataFrame(
    columns=['greek', 'latin', 'foobar', 'C1', 'C2', 'C3', 'C4'])

for i in df1.index.get_level_values('greek').unique():
    for j in df1.loc[i].index.get_level_values('latin').unique():
        for k in df2.loc[i].index.get_level_values('foobar').unique():
            df3 = df3.append(pd.Series([i, j, k], 
                                       index=['greek', 'latin', 'foobar']
                                       ).append(
                df1.loc[i, j] * df2.loc[i, k]), ignore_index=True)

df3.set_index(['greek', 'latin', 'foobar'], inplace=True)

如您所见,代码非常手动,多次手动定义列等,最后设置索引。这是输入和optput。它们是正确的,正是我想要的:

df1:

                   C1        C2        C3        C4
 greek latin                                        
 alpha A      0.208380  0.856373 -1.041598  1.219707
       B      1.547903 -0.001023  0.918973  1.153554
       C      0.195868  2.772840  0.060960  0.311247
 beta  A      0.690405 -1.258012  0.118000 -0.346677
       B      0.488327 -1.206428  0.967658  1.198287
       C      0.420098 -0.165721  0.626893 -0.377909,

df2:

                C1  C2  C3  C4
greek foobar                
 alpha foo      1   0   1   0
       bar      1   1   1   1
 beta  foo      0   0   0   0
       bar      0   2   0   4

结果:

                           C1        C2        C3        C4
 greek latin foobar                                        
 alpha A     foo     0.208380  0.000000 -1.041598  0.000000
             bar     0.208380  0.856373 -1.041598  1.219707
       B     foo     1.547903 -0.000000  0.918973  0.000000
             bar     1.547903 -0.001023  0.918973  1.153554
       C     foo     0.195868  0.000000  0.060960  0.000000
             bar     0.195868  2.772840  0.060960  0.311247
 beta  A     foo     0.000000 -0.000000  0.000000 -0.000000
             bar     0.000000 -2.516025  0.000000 -1.386708
       B     foo     0.000000 -0.000000  0.000000  0.000000
             bar     0.000000 -2.412855  0.000000  4.793149
       C     foo     0.000000 -0.000000  0.000000 -0.000000
             bar     0.000000 -0.331443  0.000000 -1.511638

提前致谢!

【问题讨论】:

【参考方案1】:

这是没有 for 循环的代码。基本思想是扩展两个矩阵,使它们大小相同并且可以相乘。然后相乘……

代码:

# build an index from the three index columns
idx = [df1.index.get_level_values(col).unique() for col in columns[:2]
       ] + [df2.index.get_level_values(columns[2]).unique()]
size = [len(x) for x in idx]
index = pd.MultiIndex.from_product(idx, names=columns[:3])

# get the indices needed for df1 and df2
idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1)
idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1)
idx_1 = idx_a[0]
idx_2 = idx_a[1] + idx_b[0] * size[2]

# map the two frames into a multiply-able form
y1 = df1.values[idx_1, :]
y2 = df2.values[idx_2, :]

# multiply the two frames
df = pd.DataFrame(y1 * y2, index=index, columns=columns[3:])

测试数据:

import pandas as pd
import numpy as np

a = ['alpha', 'beta']
b = ['A', 'B', 'C']
c = ['foo', 'bar']
data_columns = ['C1', 'C2', 'C3', 'C4']
columns = ['greek', 'latin', 'foobar'] + data_columns

df1 = pd.DataFrame(np.random.randn(len(a) * len(b), len(data_columns)),
                   index=pd.MultiIndex.from_product(
                       [a,b], names=columns[0:2]),
                   columns=data_columns
                   )
df2 = pd.DataFrame(np.array([[1, 0, 1, 0],
                             [1, 1, 1, 1],
                             [0, 0, 0, 0],
                             [0, 2, 0, 4],
                             ]),
                   index=pd.MultiIndex.from_product(
                       [a, c],
                       names=[columns[0], columns[2]]),
                   columns=data_columns
                   )

计时码:

def method1():
    df3 = pd.DataFrame(columns=columns)

    for i in df1.index.get_level_values('greek').unique():
            for j in df1.loc[i].index.get_level_values('latin').unique():
                for k in df2.loc[i].index.get_level_values('foobar').unique():
                    df3 = df3.append(pd.Series(
                        [i, j, k],
                        index=columns[:3]).append(
                        df1.loc[i, j] * df2.loc[i, k]), ignore_index=True)
    df3.set_index(columns[:3], inplace=True)
    return df3

def method2():
    # build an index from the three index columns
    idx = [df1.index.get_level_values(col).unique() for col in columns[:2]
           ] + [df2.index.get_level_values(columns[2]).unique()]
    size = [len(x) for x in idx]
    index = pd.MultiIndex.from_product(idx, names=columns[:3])

    # get the indices needed for df1 and df2
    idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1)
    idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1)
    idx_1 = idx_a[0]
    idx_2 = idx_a[1] + idx_b[0] * size[2]

    # map the two frames into a multiply-able form
    y1 = df1.values[idx_1, :]
    y2 = df2.values[idx_2, :]

    # multiply the to frames
    df4 = pd.DataFrame(y1 * y2, index=index, columns=columns[3:])
    return df4

from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))

结果:

7.96668368373
0.149504332128

【讨论】:

【参考方案2】:

我创建了以下似乎可行并提供正确结果的解决方案。虽然斯蒂芬的答案仍然是最快的解决方案,但它足够接近但提供了很大的优势,它适用于任意 MultiIndexed 帧,而不是索引是列表乘积的帧。这是我需要解决的情况,尽管我提供的示例并未反映这一点。感谢 Stephen 为该案例提供了出色而快速的解决方案 - 当然从该代码中学到了一些东西!

代码:

dft = df2.swaplevel()
dft.sortlevel(level=0,inplace=True)
df5=pd.concat([df1*dft.loc[i,:] for i in dft.index.get_level_values('foobar').unique() ], keys=dft.index.get_level_values('foobar').unique().tolist(), names=['foobar'])
df5=df5.reorder_levels(['greek', 'latin', 'foobar'],axis=0)
df5.sortlevel(0,inplace=True)

测试数据:

import pandas as pd
import numpy as np

a = ['alpha', 'beta']
b = ['A', 'B', 'C']
c = ['foo', 'bar']
data_columns = ['C1', 'C2', 'C3', 'C4']
columns = ['greek', 'latin', 'foobar'] + data_columns

df1 = pd.DataFrame(np.random.randn(len(a) * len(b), len(data_columns)),
                   index=pd.MultiIndex.from_product(
                       [a,b], names=columns[0:2]),
                   columns=data_columns
                   )
df2 = pd.DataFrame(np.array([[1, 0, 1, 0],
                             [1, 1, 1, 1],
                             [0, 0, 0, 0],
                             [0, 2, 0, 4],
                             ]),
                   index=pd.MultiIndex.from_product(
                       [a, c],
                       names=[columns[0], columns[2]]),
                   columns=data_columns
                   )

计时码:

def method1():
    df3 = pd.DataFrame(columns=columns)

    for i in df1.index.get_level_values('greek').unique():
            for j in df1.loc[i].index.get_level_values('latin').unique():
                for k in df2.loc[i].index.get_level_values('foobar').unique():
                    df3 = df3.append(pd.Series(
                        [i, j, k],
                        index=columns[:3]).append(
                        df1.loc[i, j] * df2.loc[i, k]), ignore_index=True)
    df3.set_index(columns[:3], inplace=True)
    return df3

def method2():
    # build an index from the three index columns
    idx = [df1.index.get_level_values(col).unique() for col in columns[:2]
           ] + [df2.index.get_level_values(columns[2]).unique()]
    size = [len(x) for x in idx]
    index = pd.MultiIndex.from_product(idx, names=columns[:3])

    # get the indices needed for df1 and df2
    idx_a = np.indices((size[0] * size[1], size[2])).reshape(2, -1)
    idx_b = np.indices((size[0], size[1] * size[2])).reshape(2, -1)
    idx_1 = idx_a[0]
    idx_2 = idx_a[1] + idx_b[0] * size[2]

    # map the two frames into a multiply-able form
    y1 = df1.values[idx_1, :]
    y2 = df2.values[idx_2, :]

    # multiply the to frames
    df4 = pd.DataFrame(y1 * y2, index=index, columns=columns[3:])
    return df4


def method3():
    dft = df2.swaplevel()
    dft.sortlevel(level=0,inplace=True)
    df5=pd.concat([df1*dft.loc[i,:] for i in dft.index.get_level_values('foobar').unique() ], keys=dft.index.get_level_values('foobar').unique().tolist(), names=['foobar'])
    df5=df5.reorder_levels(['greek', 'latin', 'foobar'],axis=0)
    df5.sortlevel(0,inplace=True)
    return df5


from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))
print(timeit(method3, number=50))

结果:

4.089807642158121
0.12291539693251252
0.33667341712862253

【讨论】:

以上是关于两个 pandas MultiIndex 帧将每一行与每一行相乘的主要内容,如果未能解决你的问题,请参考以下文章

pandas:将两个 DataFrame 与已排序的 MultiIndex 连接起来,使得结果具有已排序的 MultiIndex

合并两个 pandas.core.indexes.multi.MultiIndex

在 Pandas 中将两个 MultiIndex 级别合并为一个

concat和sum multiindex pandas系列

使用 pandas 创建一个 multiIndex

Pandas - 读取 MultiIndex 文件的特定列