Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列相关的知识,希望对你有一定的参考价值。
download the data from this link
Product | Price | CS_Medium | CS_Small | SC_A | SC_B | SC_C
0 R123 | 1.18 | 0.15 | | | | 0.38
1 R234 | 0.23 | | (0.03) | 0.04 | | 0.05
Sum_values是CS和SC的特定组合的所有值的总和
我花了1.5天时间无法转换它。使用堆栈,转置和groupby,但没有任何工作。 10天前开始编码,新编码,请帮忙。请看图片,我无法在文本区域正确粘贴表格。
Product CS SC Price SUM_values
0 R123 Medium A 1.18 0.15
1 R123 Medium B 1.18 0.15
2 R123 Medium C 1.18 0.54
3 R123 Small A 1.18 -
4 R123 Small B 1.18 -
5 R123 Small C 1.18 0.38
6 R234 Medium A 0.23 0.04
7 R234 Medium B 0.23 -
8 R234 Medium C 0.23 0.05
9 R234 Small A 0.23 0.01
10 R234 Small B 0.23 (0.03)
11 R234 Small C 0.23 0.05
答案
Option 1
不太明显但没有硬编码的值。
from itertools import product
d_ = df.set_index('Product')
prc = d_.pop('Price')
d_.columns = d_.columns.str.split('_', expand=True)
c = d_.columns
l0 = c.levels[0]
l1 = c.levels[1]
b0 = c.labels[0]
b1 = c.labels[1]
r0 = range(len(l0))
ptups = list(product(*(l1[b1][b0 == i] for i in r0)))
midx = pd.MultiIndex.from_tuples(
[(x,) + t for x in l0 for t in ptups],
names=['key'] + l0.tolist()
)
n = midx.nlevels
_d = d_[[(x0, x1) for x0, y1 in zip(l0, zip(*ptups)) for x1 in y1]]
_d.columns = midx
_d = _d.stack(list(range(1, n)), dropna=False)
_d.fillna(0).sum(1).where(_d.notna().any(1)).reset_index(name='SUM_values')
Product CS SC SUM_values
0 R123 Medium A 0.15
1 R123 Medium B 0.15
2 R123 Medium C 0.53
3 R123 Small A NaN
4 R123 Small B NaN
5 R123 Small C 0.38
6 R234 Medium A 0.04
7 R234 Medium B NaN
8 R234 Medium C 0.05
9 R234 Small A 0.01
10 R234 Small B -0.03
11 R234 Small C 0.02
Option 2
使用defaultdict
和for
循环
from collections import defaultdict
d = defaultdict(list)
for c in df.columns:
k, *v = c.split('_')
if v:
d[k].append(v[0])
pd.DataFrame([
[row.Product, c, s, row.Price, row[f'CS_{c}'], row[f'SC_{s}']]
for i, row in df.iterrows()
for c in d['CS'] for s in d['SC']
], columns='Product CS SC Price CS_v SC_v'.split()).assign(
SUM_values=lambda d: d.CS_v.add(d.SC_v, fill_value=0)
).drop(['CS_v', 'SC_v'], 1)
Product CS SC Price SUM_values
0 R123 Medium A 1.18 0.15
1 R123 Medium B 1.18 0.15
2 R123 Medium C 1.18 0.53
3 R123 Small A 1.18 NaN
4 R123 Small B 1.18 NaN
5 R123 Small C 1.18 0.38
6 R234 Medium A 0.23 0.04
7 R234 Medium B 0.23 NaN
8 R234 Medium C 0.23 0.05
9 R234 Small A 0.23 0.01
10 R234 Small B 0.23 -0.03
11 R234 Small C 0.23 0.02
选项3
使用defaultdict
,itertools.product
和lookup
from itertools import product
from collections import defaultdict
d = defaultdict(list)
for c in df.columns:
k, *v = c.split('_')
if v:
d[k].append(v[0])
d = {**df[['Product']].to_dict('l'), **d}
d_ = df.set_index('Product')
ndf = pd.DataFrame(dict(zip(d.keys(), zip(*product(*d.values())))))
cs = pd.Series(d_.lookup(ndf.Product, ndf.CS.radd('CS_')), ndf.index)
sc = pd.Series(d_.lookup(ndf.Product, ndf.SC.radd('SC_')), ndf.index)
ndf['SUM_values'] = cs.add(sc, fill_value=0)
ndf[['Product', 'CS', 'SC', 'SUM_values']]
Product CS SC SUM_values
0 R123 Medium A 0.15
1 R123 Medium B 0.15
2 R123 Medium C 0.53
3 R123 Small A NaN
4 R123 Small B NaN
5 R123 Small C 0.38
6 R234 Medium A 0.04
7 R234 Medium B NaN
8 R234 Medium C 0.05
9 R234 Small A 0.01
10 R234 Small B -0.03
11 R234 Small C 0.02
另一答案
好的,你可以这样做:
df = pd.DataFrame({'Product':['R123','R234'],
'Price':[1.18,0.23],
'CS_Medium':[.15, np.nan],
'CS_Small':[np.nan, -0.03],
'SC_A':[np.nan,0.04],
'SC_B':[np.nan,np.nan],
'SC_C':[0.38,0.05]})
df.columns = df.columns.str.split('_').str[-1]
(df.melt(['Product','Medium','Small','Price'],value_name='Values_1', var_name='SC')
.melt(['Product','SC','Price','Values_1'],value_name='Values_2',var_name='CS')
.set_index(['Product','CS','SC','Price'])
.sum(axis=1)
.reset_index(name='SUM_Values')
.sort_values(by=['Product','CS','SC']))
输出:
Product CS SC Price SUM_values
0 R123 Medium A 1.18 0.15
2 R123 Medium B 1.18 0.15
4 R123 Medium C 1.18 0.53
6 R123 Small A 1.18 NaN
8 R123 Small B 1.18 NaN
10 R123 Small C 1.18 0.38
1 R234 Medium A 0.23 0.04
3 R234 Medium B 0.23 NaN
5 R234 Medium C 0.23 0.05
7 R234 Small A 0.23 0.01
9 R234 Small B 0.23 -0.03
11 R234 Small C 0.23 0.02
另一答案
我正在使用wide_to_long
l=['Product','Price']
s1=l+df.columns[df.columns.str.startswith('SC')].tolist()
s2=l+df.columns[df.columns.str.startswith('CS')].tolist()
v1=pd.wide_to_long(df[s1],['SC'],i=['Product','Price'],j='SCKey',sep='_',suffix='\w+').reset_index(level=2)
v2=pd.wide_to_long(df[s2],['CS'],i=['Product','Price'],j='CSKey',sep='_',suffix='\w+').reset_index(level=2)
v=v1.join(v2,how='outer').reset_index()
v.assign(SUM_values=v.SC.add(v.CS,fill_value=0))
Out[66]:
Product Price SCKey SC CSKey CS SUM_values
0 R123 1.18 A NaN Medium 0.15 0.15
1 R123 1.18 A NaN Small NaN NaN
2 R123 1.18 B NaN Medium 0.15 0.15
3 R123 1.18 B NaN Small NaN NaN
4 R123 1.18 C 0.38 Medium 0.15 0.53
5 R123 1.18 C 0.38 Small NaN 0.38
6 R234 0.23 A 0.04 Medium NaN 0.04
7 R234 0.23 A 0.04 Small -0.03 0.01
8 R234 0.23 B NaN Medium NaN NaN
9 R234 0.23 B NaN Small -0.03 -0.03
10 R234 0.23 C 0.05 Medium NaN 0.05
11 R234 0.23 C 0.05 Small -0.03 0.02
详细信息:
v1
Out[38]:
SCKey SC
Product Price
R123 1.18 A NaN
1.18 B NaN
1.18 C 0.38
R234 0.23 A 0.04
0.23 B NaN
0.23 C 0.05
v2
Out[39]:
CSKey CS
Product Price
R123 1.18 Medium 0.15
1.18 Small NaN
R234 0.23 Medium NaN
0.23 Small -0.03
以上是关于Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列的主要内容,如果未能解决你的问题,请参考以下文章
Plotly:如何使用长格式或宽格式的 pandas 数据框制作线图?