Pandas - Groupby 多索引级别,获取可能的组合,然后转换数据
Posted
技术标签:
【中文标题】Pandas - Groupby 多索引级别,获取可能的组合,然后转换数据【英文标题】:Pandas - Groupby a multiindex level, get the possible combinations, then transform the data 【发布时间】:2018-07-24 21:00:36 【问题描述】:我一直在努力解决分组、组合和转换的问题。我目前的解决方案是:
df = df.groupby(level='lvl_2').transform(lambda x: x[0]/x[1])
但这并没有解决我的某些问题。
假设代码如下:
import pandas as pd
import numpy as np
import datetime
today = datetime.date.today()
today_1 = datetime.date.today() - datetime.timedelta(1)
today_2 = datetime.date.today() - datetime.timedelta(2)
ticker_date = [('first', 'a',today), ('first', 'a',today_1), ('first', 'a',today_2),
('first', 'c',today), ('first', 'c',today_1), ('first', 'c',today_2),
('first', 'b',today), ('first', 'b',today_1), ('first', 'b',today_2),
('first', 'd',today), ('first', 'd',today_1), ('first', 'd',today_2)]
index_df = pd.MultiIndex.from_tuples(ticker_date,names=['lvl_1','lvl_2','lvl_3'])
df = pd.DataFrame(np.random.rand(12), index_df, ['idx'])
输出是:
idx
lvl_1 lvl_2 lvl_3
first a 2018-02-14 0.421075
2018-02-13 0.278418
2018-02-12 0.117888
c 2018-02-14 0.716823
2018-02-13 0.241261
2018-02-12 0.772491
b 2018-02-14 0.681738
2018-02-13 0.636927
2018-02-12 0.668964
d 2018-02-14 0.770797
2018-02-13 0.11469
2018-02-12 0.877965
我需要以下物品:
-
使用可能的 lvl_2 元素组合获取新的多索引数据帧。
转换我的数据以获得每个元素的比率
这是一个插图:
在这里,我创建了一个“新”列。
new
lvl_1 lvl_2 lvl_3
first a/c 2018-02-14 0.587418372
2018-02-13 1.154011631
2018-02-12 0.152607603
a/b 2018-02-14 0.617649302
2018-02-13 0.437127018
2018-02-12 0.17622473
a/d 2018-02-14 0.546285209
2018-02-13 2.427569971
2018-02-12 0.134274145
c/b 2018-02-14 1.051464052
2018-02-13 0.378789092
2018-02-12 1.154757207
c/d 2018-02-14 0.929976375
2018-02-13 2.103592292
2018-02-12 0.87986537
b/d 2018-02-14 0.884458554
2018-02-13 5.553465865
2018-02-12 0.761948369
进一步解释:
new
lvl_1 lvl_2 lvl_3
first a/c 2018-02-14 0.587418372
2018-02-13 1.154011631
2018-02-12 0.152607603
这里,我做a的元素与c的比例:
0.587418 = 0.421075/0.716823
1.154012 = 0.278418/0.241261
0.152608 = 0.117888/0.772491
我尝试了 groupby 和 transform 方法,例如:
df = df.groupby(level='lvl_2').transform(lambda x: x[0]/x[1])
但很明显,这只会转换每个特定级别的第一个和第二个值。另外,我不知道如何使用组合建立新的多索引。 (a/c, a/b, a/d, c/b, c/d, b/d)
我觉得我在正确的道路上,但我觉得卡住了。
【问题讨论】:
附言。下次请提供一组随机种子(np.random.seed = x
),以便我们都可以使用相同的数据并比较结果
感谢@jezrael 指出它应该是np.randome.seed(x)
确实!没有考虑。谢谢指出
【参考方案1】:
如果第一级与示例中的其他级别的组合相同,则可以在具有div
的列中使用reindex
到MultiIndex
:
#same as Maarten Fabré answer
np.random.seed(42)
from itertools import combinations
#get combination of second level values
c = pd.MultiIndex.from_tuples(list(combinations(df.index.levels[1], 2)))
#reshape to unique columns of second level
print (df['idx'].unstack(1))
lvl_2 a b c d
lvl_1 lvl_3
first 2018-02-12 0.731994 0.601115 0.155995 0.969910
2018-02-13 0.950714 0.866176 0.156019 0.020584
2018-02-14 0.374540 0.058084 0.598658 0.708073
#reindex by both levels
df1 = df['idx'].unstack(1).reindex(columns=c, level=0)
print (df1)
a b c
b c d c d d
lvl_1 lvl_3
first 2018-02-12 0.731994 0.731994 0.731994 0.601115 0.601115 0.155995
2018-02-13 0.950714 0.950714 0.950714 0.866176 0.866176 0.156019
2018-02-14 0.374540 0.374540 0.374540 0.058084 0.058084 0.598658
df2 = df['idx'].unstack(1).reindex(columns=c, level=1)
print (df2)
a b c
b c d c d d
lvl_1 lvl_3
first 2018-02-12 0.601115 0.155995 0.969910 0.155995 0.969910 0.969910
2018-02-13 0.866176 0.156019 0.020584 0.156019 0.020584 0.020584
2018-02-14 0.058084 0.598658 0.708073 0.598658 0.708073 0.708073
#divide with flatten MultiIndex
df3 = df1.div(df2)
df3.columns = df3.columns.map('/'.join)
#reshape back and change order of levels, sorting indices
df3 = df3.stack().reorder_levels([0,2,1]).sort_index()
print (df3)
lvl_1 lvl_3
first a/b 2018-02-12 1.217727
2018-02-13 1.097599
2018-02-14 6.448292
a/c 2018-02-12 4.692434
2018-02-13 6.093594
2018-02-14 0.625632
a/d 2018-02-12 0.754703
2018-02-13 46.185944
2018-02-14 0.528957
b/c 2018-02-12 3.853437
2018-02-13 5.551748
2018-02-14 0.097023
b/d 2018-02-12 0.619764
2018-02-13 42.079059
2018-02-14 0.082031
c/d 2018-02-12 0.160834
2018-02-13 7.579425
2018-02-14 0.845476
dtype: float64
【讨论】:
我喜欢你和 Maarteen 的两个答案。我将两者结合起来,但我明白了。喜欢 unstack/stack 方法!谢谢reindex(level=)
的绝妙技巧。我曾想过stack
/unstack
,
@MaartenFabré - 谢谢。还有np.random.seed(42)
【参考方案2】:
from itertools import combinations
def calc_ratios(data):
comb = combinations(data.index.get_level_values('lvl_2').unique(), 2)
ratios =
f'i/j':
data.xs(i, level='lvl_2') /
data.xs(j, level='lvl_2')
for i, j in comb
# print(ratios)
if ratios:
return pd.concat(ratios)
result = pd.concat(calc_ratios(data) for group, data in df.groupby('lvl_1'))
lvl_1 lvl_3 idx a/b first 2018-02-14 6.448292467809392 a/b first 2018-02-13 1.0975992712883451 a/b first 2018-02-12 1.2177269366284045 a/c first 2018-02-14 0.6256323575698127 a/c first 2018-02-13 6.093594353302192 a/c first 2018-02-12 4.692433684425558 a/d first 2018-02-14 0.5289572433565499 a/d first 2018-02-13 46.185944271838835 a/d first 2018-02-12 0.7547030687230791 b/d first 2018-02-14 0.08203059119870332 b/d first 2018-02-13 42.07905879677424 b/d first 2018-02-12 0.6197637959891664 c/b first 2018-02-14 10.306839775450461 c/b first 2018-02-13 0.18012345549282302 c/b first 2018-02-12 0.25950860865015657 c/d first 2018-02-14 0.8454761601705119 c/d first 2018-02-13 7.579425474360648 c/d first 2018-02-12 0.16083404038888807
(使用np.random.seed(42)
生成的数据)
【讨论】:
那么在你的答案中使用np.random.seed()
怎么样? ;)
感谢马蒂恩!很有用!以上是关于Pandas - Groupby 多索引级别,获取可能的组合,然后转换数据的主要内容,如果未能解决你的问题,请参考以下文章