按多列分组并将dict元素的中值作为熊猫中的新列

Posted

技术标签:

【中文标题】按多列分组并将dict元素的中值作为熊猫中的新列【英文标题】:Group by multiple columns and get median of dict elements as a new column in pandas 【发布时间】:2022-01-01 15:18:09 【问题描述】:

我有一个如下所示的数据框:

+-------+----------+-------------+-----------------------------------------------------+
| item  | category | subcategory |                     sales_count                     |
+-------+----------+-------------+-----------------------------------------------------+
| ItemA |        0 | p           | store1:50,store2:70,store3:90,store4:44,store5:76 |
| ItemB |        0 | p           | store2:22,store3:15,store4:77,store5:0            |
| ItemC |        0 | p           | store1:46,store2:13,store3:9,store4:87,store5:45  |
| ItemD |        0 | q           | store1:88,store2:16,store4:5,store5:2             |
| ItemE |        0 | q           | store1:7,store2:55                                |
| ItemF |        1 | t           | store3:25,store4:75,store5:87                     |
| ItemG |        1 | t           | store1:32,store3:66,store4:87,store5:0            |
| ItemH |        1 | t           | store1:54,store2:33,store3:12,store4:67,store5:8  |
+-------+----------+-------------+-----------------------------------------------------+

我想生成一个新列,其中包含跨类别和子类别的销售额中位数。

即 itemA 的 'median_across_group' 值应该是 category = 0 & subcategory = p 中所有 sales_count 的中位数。

如何实现dict元素的groupby和median?


+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| item  | category | subcategory |                     sales_count                     |             median_across_group             |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| ItemA |        0 | p           | store1:50,store2:70,store3:90,store4:44,store5:76 | <median of category 0, subcategory p items> |
| ItemB |        0 | p           | store2:22,store3:15,store4:77,store5:0            | <median of category 0, subcategory p items> |
| ItemC |        0 | p           | store1:46,store2:13,store3:9,store4:87,store5:45  | <median of category 0, subcategory p items> |
| ItemD |        0 | q           | store1:88,store2:16,store4:5,store5:2             | <median of category 0, subcategory q items> |
| ItemE |        0 | q           | store1:7,store2:55                                | <median of category 0, subcategory q items> |
| ItemF |        1 | t           | store3:25,store4:75,store5:87                     | <median of category 1, subcategory t items> |
| ItemG |        1 | t           | store1:32,store3:66,store4:87,store5:0            | <median of category 1, subcategory t items> |
| ItemH |        1 | t           | store1:54,store2:33,store3:12,store4:67,store5:8  | <median of category 1, subcategory t items> |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+

【问题讨论】:

你尝试了什么?你的代码在哪里? 您可以将数据作为 DataFrame,我们可以简单地复制并在解决方案中使用。 您可以为示例数据添加预期结果。它有助于查看解决方案是否正常工作。 我在下面添加了我的代码作为答案。感谢您对样本数据和预期结果的建议 【参考方案1】:

你可能想试试这个:

代码

import pandas as pd


df = pd.DataFrame(
    'item': ['ItemA', 'ItemB', 'ItemC', 'ItemD', 'ItemE', 'ItemF', 'ItemG', 'ItemH', ],
    'category': [0, 0, 0, 0, 0, 1, 1, 1],
    'subcategory': ['p', 'p', 'p', 'q', 'q', 't', 't', 't'],
    'sales_count': [
        'store1':50,'store2':70,'store3':90,'store4':44,'store5':76,
        'store2':22,'store3':15,'store4':77,'store5':0,
        'store1':46,'store2':13,'store3':9,'store4':87,'store5':45,
        'store1':88,'store2':16,'store4':5,'store5':2,
        'store1':7,'store2':55,
        'store3':25,'store4':75,'store5':87,
        'store1':32,'store3':66,'store4':87,'store5':0,
        'store1':54,'store2':33,'store3':12,'store4':67,'store5':8
    ]
)

median = 
for idx, row in df.iterrows():

    key_combo   = str(row['category']) + str(row['subcategory'])
    values_list = list(row['sales_count'].values())

    median[key_combo] = (
        values_list                         # Add the list if key not present
        if key_combo not in median else
        median[key_combo] + (values_list)   # Append the new list if key present
    )

for key, values in median.items():
    median[key] = sorted(values)[len(values) // 2]  # Calculate median and store in dict

def apply_median(x):
    return median[str(x.category) + str(x.subcategory)]

df['Median'] = df[['category', 'subcategory']].apply(apply_median, axis=1)

print(df)

输出

    item  category subcategory                                        sales_count  Median
0  ItemA         0           p  'store1': 50, 'store2': 70, 'store3': 90, 'st...      46
1  ItemB         0           p  'store2': 22, 'store3': 15, 'store4': 77, 'st...      46
2  ItemC         0           p  'store1': 46, 'store2': 13, 'store3': 9, 'sto...      46
3  ItemD         0           q  'store1': 88, 'store2': 16, 'store4': 5, 'sto...      16
4  ItemE         0           q                        'store1': 7, 'store2': 55      16
5  ItemF         1           t         'store3': 25, 'store4': 75, 'store5': 87      54
6  ItemG         1           t  'store1': 32, 'store3': 66, 'store4': 87, 'st...      54
7  ItemH         1           t  'store1': 54, 'store2': 33, 'store3': 12, 'st...      54

【讨论】:

【参考方案2】:

我找到了一个更简单的方法,

def get_dict_median(x):
    flat_list = [i for k in list(x) for i in k] #Flatten all lists into one single list
    return(np.median(flat_list))

df['sales_count_list'] = df['sales_count'].apply(lambda x: list(x.values()))
df['group_median']=df.groupby(['category','subcategory'])['sales_count_list'].transform(get_dict_median)

【讨论】:

以上是关于按多列分组并将dict元素的中值作为熊猫中的新列的主要内容,如果未能解决你的问题,请参考以下文章

df.apply 输出的新列中作为参数的特定熊猫列

按计算分组熊猫

如何按熊猫中的中值对箱线图进行排序

按多列分组时熊猫组合键

如何按多列分组以在熊猫数据框中列出

将分组的聚合唯一列添加到熊猫数据框