Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列相关的知识,希望对你有一定的参考价值。

download the data from this link

input table

   Product  | Price | CS_Medium | CS_Small | SC_A | SC_B |   SC_C
0   R123    |  1.18 |   0.15    |          |      |      |   0.38
1   R234    |  0.23 |           |  (0.03)  | 0.04 |      |    0.05 

Sum_values是CS和SC的特定组合的所有值的总和

我花了1.5天时间无法转换它。使用堆栈,转置和groupby,但没有任何工作。 10天前开始编码,新编码,请帮忙。请看图片,我无法在文本区域正确粘贴表格。

output table

   Product      CS SC  Price SUM_values
0     R123  Medium  A   1.18       0.15
1     R123  Medium  B   1.18       0.15
2     R123  Medium  C   1.18       0.54
3     R123   Small  A   1.18          -
4     R123   Small  B   1.18          -
5     R123   Small  C   1.18       0.38
6     R234  Medium  A   0.23       0.04
7     R234  Medium  B   0.23          -
8     R234  Medium  C   0.23       0.05
9     R234   Small  A   0.23       0.01
10    R234   Small  B   0.23     (0.03)
11    R234   Small  C   0.23       0.05
答案

Option 1

不太明显但没有硬编码的值。

from itertools import product

d_ = df.set_index('Product')
prc = d_.pop('Price')

d_.columns = d_.columns.str.split('_', expand=True)

c = d_.columns
l0 = c.levels[0]
l1 = c.levels[1]
b0 = c.labels[0]
b1 = c.labels[1]

r0 = range(len(l0))
ptups = list(product(*(l1[b1][b0 == i] for i in r0)))

midx = pd.MultiIndex.from_tuples(
    [(x,) + t for x in l0 for t in ptups],
    names=['key'] + l0.tolist()
)
n = midx.nlevels

_d = d_[[(x0, x1) for x0, y1 in zip(l0, zip(*ptups)) for x1 in y1]]
_d.columns = midx
_d = _d.stack(list(range(1, n)), dropna=False)

_d.fillna(0).sum(1).where(_d.notna().any(1)).reset_index(name='SUM_values')

   Product      CS SC  SUM_values
0     R123  Medium  A        0.15
1     R123  Medium  B        0.15
2     R123  Medium  C        0.53
3     R123   Small  A         NaN
4     R123   Small  B         NaN
5     R123   Small  C        0.38
6     R234  Medium  A        0.04
7     R234  Medium  B         NaN
8     R234  Medium  C        0.05
9     R234   Small  A        0.01
10    R234   Small  B       -0.03
11    R234   Small  C        0.02

Option 2

使用defaultdictfor循环

from collections import defaultdict

d = defaultdict(list)
for c in df.columns:
    k, *v = c.split('_')
    if v:
        d[k].append(v[0])

pd.DataFrame([
    [row.Product, c, s, row.Price, row[f'CS_{c}'], row[f'SC_{s}']]
    for i, row in df.iterrows()
    for c in d['CS'] for s in d['SC']
], columns='Product CS SC Price CS_v SC_v'.split()).assign(
    SUM_values=lambda d: d.CS_v.add(d.SC_v, fill_value=0)
).drop(['CS_v', 'SC_v'], 1)

   Product      CS SC  Price  SUM_values
0     R123  Medium  A   1.18        0.15
1     R123  Medium  B   1.18        0.15
2     R123  Medium  C   1.18        0.53
3     R123   Small  A   1.18         NaN
4     R123   Small  B   1.18         NaN
5     R123   Small  C   1.18        0.38
6     R234  Medium  A   0.23        0.04
7     R234  Medium  B   0.23         NaN
8     R234  Medium  C   0.23        0.05
9     R234   Small  A   0.23        0.01
10    R234   Small  B   0.23       -0.03
11    R234   Small  C   0.23        0.02

选项3

使用defaultdictitertools.productlookup

from itertools import product
from collections import defaultdict

d = defaultdict(list)
for c in df.columns:
    k, *v = c.split('_')
    if v:
        d[k].append(v[0])

d = {**df[['Product']].to_dict('l'), **d}

d_ = df.set_index('Product')

ndf = pd.DataFrame(dict(zip(d.keys(), zip(*product(*d.values())))))

cs = pd.Series(d_.lookup(ndf.Product, ndf.CS.radd('CS_')), ndf.index)
sc = pd.Series(d_.lookup(ndf.Product, ndf.SC.radd('SC_')), ndf.index)

ndf['SUM_values'] = cs.add(sc, fill_value=0)
ndf[['Product', 'CS', 'SC', 'SUM_values']]

   Product      CS SC  SUM_values
0     R123  Medium  A        0.15
1     R123  Medium  B        0.15
2     R123  Medium  C        0.53
3     R123   Small  A         NaN
4     R123   Small  B         NaN
5     R123   Small  C        0.38
6     R234  Medium  A        0.04
7     R234  Medium  B         NaN
8     R234  Medium  C        0.05
9     R234   Small  A        0.01
10    R234   Small  B       -0.03
11    R234   Small  C        0.02
另一答案

好的,你可以这样做:

df = pd.DataFrame({'Product':['R123','R234'],
                        'Price':[1.18,0.23],
                        'CS_Medium':[.15, np.nan],
                        'CS_Small':[np.nan, -0.03],
                        'SC_A':[np.nan,0.04],
                        'SC_B':[np.nan,np.nan],
                        'SC_C':[0.38,0.05]})

df.columns = df.columns.str.split('_').str[-1]

(df.melt(['Product','Medium','Small','Price'],value_name='Values_1', var_name='SC')
  .melt(['Product','SC','Price','Values_1'],value_name='Values_2',var_name='CS')
  .set_index(['Product','CS','SC','Price'])
  .sum(axis=1)
  .reset_index(name='SUM_Values')
  .sort_values(by=['Product','CS','SC']))

输出:

   Product      CS SC  Price  SUM_values
0     R123  Medium  A   1.18        0.15
2     R123  Medium  B   1.18        0.15
4     R123  Medium  C   1.18        0.53
6     R123   Small  A   1.18         NaN
8     R123   Small  B   1.18         NaN
10    R123   Small  C   1.18        0.38
1     R234  Medium  A   0.23        0.04
3     R234  Medium  B   0.23         NaN
5     R234  Medium  C   0.23        0.05
7     R234   Small  A   0.23        0.01
9     R234   Small  B   0.23       -0.03
11    R234   Small  C   0.23        0.02
另一答案

我正在使用wide_to_long

l=['Product','Price']

s1=l+df.columns[df.columns.str.startswith('SC')].tolist()
s2=l+df.columns[df.columns.str.startswith('CS')].tolist()


v1=pd.wide_to_long(df[s1],['SC'],i=['Product','Price'],j='SCKey',sep='_',suffix='\w+').reset_index(level=2)

v2=pd.wide_to_long(df[s2],['CS'],i=['Product','Price'],j='CSKey',sep='_',suffix='\w+').reset_index(level=2)

v=v1.join(v2,how='outer').reset_index()
v.assign(SUM_values=v.SC.add(v.CS,fill_value=0))
Out[66]: 
   Product  Price SCKey    SC   CSKey    CS  SUM_values
0     R123   1.18     A   NaN  Medium  0.15        0.15
1     R123   1.18     A   NaN   Small   NaN         NaN
2     R123   1.18     B   NaN  Medium  0.15        0.15
3     R123   1.18     B   NaN   Small   NaN         NaN
4     R123   1.18     C  0.38  Medium  0.15        0.53
5     R123   1.18     C  0.38   Small   NaN        0.38
6     R234   0.23     A  0.04  Medium   NaN        0.04
7     R234   0.23     A  0.04   Small -0.03        0.01
8     R234   0.23     B   NaN  Medium   NaN         NaN
9     R234   0.23     B   NaN   Small -0.03       -0.03
10    R234   0.23     C  0.05  Medium   NaN        0.05
11    R234   0.23     C  0.05   Small -0.03        0.02

详细信息:

v1
Out[38]: 
              SCKey    SC
Product Price            
R123    1.18      A   NaN
        1.18      B   NaN
        1.18      C  0.38
R234    0.23      A  0.04
        0.23      B   NaN
        0.23      C  0.05
v2
Out[39]: 
                CSKey    CS
Product Price              
R123    1.18   Medium  0.15
        1.18    Small   NaN
R234    0.23   Medium   NaN
        0.23    Small -0.03

以上是关于Python Pandas:宽格式到长格式但不同 - 类似于反向虚拟列的主要内容,如果未能解决你的问题,请参考以下文章

Pandas 合并、缩放和旋转长格式和宽格式数据帧

在熊猫中重塑宽到长

Plotly:如何使用长格式或宽格式的 pandas 数据框制作线图?

Python pandas数据框“日期”索引xlsx和csv中的不同格式

pandas多种格式数据加载

Python 数据分析 —— Pandas ②