按多列分组并将dict元素的中值作为熊猫中的新列
Posted
技术标签:
【中文标题】按多列分组并将dict元素的中值作为熊猫中的新列【英文标题】:Group by multiple columns and get median of dict elements as a new column in pandas 【发布时间】:2022-01-01 15:18:09 【问题描述】:我有一个如下所示的数据框:
+-------+----------+-------------+-----------------------------------------------------+
| item | category | subcategory | sales_count |
+-------+----------+-------------+-----------------------------------------------------+
| ItemA | 0 | p | store1:50,store2:70,store3:90,store4:44,store5:76 |
| ItemB | 0 | p | store2:22,store3:15,store4:77,store5:0 |
| ItemC | 0 | p | store1:46,store2:13,store3:9,store4:87,store5:45 |
| ItemD | 0 | q | store1:88,store2:16,store4:5,store5:2 |
| ItemE | 0 | q | store1:7,store2:55 |
| ItemF | 1 | t | store3:25,store4:75,store5:87 |
| ItemG | 1 | t | store1:32,store3:66,store4:87,store5:0 |
| ItemH | 1 | t | store1:54,store2:33,store3:12,store4:67,store5:8 |
+-------+----------+-------------+-----------------------------------------------------+
我想生成一个新列,其中包含跨类别和子类别的销售额中位数。
即 itemA 的 'median_across_group' 值应该是 category = 0 & subcategory = p 中所有 sales_count 的中位数。
如何实现dict元素的groupby和median?
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| item | category | subcategory | sales_count | median_across_group |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
| ItemA | 0 | p | store1:50,store2:70,store3:90,store4:44,store5:76 | <median of category 0, subcategory p items> |
| ItemB | 0 | p | store2:22,store3:15,store4:77,store5:0 | <median of category 0, subcategory p items> |
| ItemC | 0 | p | store1:46,store2:13,store3:9,store4:87,store5:45 | <median of category 0, subcategory p items> |
| ItemD | 0 | q | store1:88,store2:16,store4:5,store5:2 | <median of category 0, subcategory q items> |
| ItemE | 0 | q | store1:7,store2:55 | <median of category 0, subcategory q items> |
| ItemF | 1 | t | store3:25,store4:75,store5:87 | <median of category 1, subcategory t items> |
| ItemG | 1 | t | store1:32,store3:66,store4:87,store5:0 | <median of category 1, subcategory t items> |
| ItemH | 1 | t | store1:54,store2:33,store3:12,store4:67,store5:8 | <median of category 1, subcategory t items> |
+-------+----------+-------------+-----------------------------------------------------+---------------------------------------------+
【问题讨论】:
你尝试了什么?你的代码在哪里? 您可以将数据作为 DataFrame,我们可以简单地复制并在解决方案中使用。 您可以为示例数据添加预期结果。它有助于查看解决方案是否正常工作。 我在下面添加了我的代码作为答案。感谢您对样本数据和预期结果的建议 【参考方案1】:你可能想试试这个:
代码
import pandas as pd
df = pd.DataFrame(
'item': ['ItemA', 'ItemB', 'ItemC', 'ItemD', 'ItemE', 'ItemF', 'ItemG', 'ItemH', ],
'category': [0, 0, 0, 0, 0, 1, 1, 1],
'subcategory': ['p', 'p', 'p', 'q', 'q', 't', 't', 't'],
'sales_count': [
'store1':50,'store2':70,'store3':90,'store4':44,'store5':76,
'store2':22,'store3':15,'store4':77,'store5':0,
'store1':46,'store2':13,'store3':9,'store4':87,'store5':45,
'store1':88,'store2':16,'store4':5,'store5':2,
'store1':7,'store2':55,
'store3':25,'store4':75,'store5':87,
'store1':32,'store3':66,'store4':87,'store5':0,
'store1':54,'store2':33,'store3':12,'store4':67,'store5':8
]
)
median =
for idx, row in df.iterrows():
key_combo = str(row['category']) + str(row['subcategory'])
values_list = list(row['sales_count'].values())
median[key_combo] = (
values_list # Add the list if key not present
if key_combo not in median else
median[key_combo] + (values_list) # Append the new list if key present
)
for key, values in median.items():
median[key] = sorted(values)[len(values) // 2] # Calculate median and store in dict
def apply_median(x):
return median[str(x.category) + str(x.subcategory)]
df['Median'] = df[['category', 'subcategory']].apply(apply_median, axis=1)
print(df)
输出
item category subcategory sales_count Median
0 ItemA 0 p 'store1': 50, 'store2': 70, 'store3': 90, 'st... 46
1 ItemB 0 p 'store2': 22, 'store3': 15, 'store4': 77, 'st... 46
2 ItemC 0 p 'store1': 46, 'store2': 13, 'store3': 9, 'sto... 46
3 ItemD 0 q 'store1': 88, 'store2': 16, 'store4': 5, 'sto... 16
4 ItemE 0 q 'store1': 7, 'store2': 55 16
5 ItemF 1 t 'store3': 25, 'store4': 75, 'store5': 87 54
6 ItemG 1 t 'store1': 32, 'store3': 66, 'store4': 87, 'st... 54
7 ItemH 1 t 'store1': 54, 'store2': 33, 'store3': 12, 'st... 54
【讨论】:
【参考方案2】:我找到了一个更简单的方法,
def get_dict_median(x):
flat_list = [i for k in list(x) for i in k] #Flatten all lists into one single list
return(np.median(flat_list))
df['sales_count_list'] = df['sales_count'].apply(lambda x: list(x.values()))
df['group_median']=df.groupby(['category','subcategory'])['sales_count_list'].transform(get_dict_median)
【讨论】:
以上是关于按多列分组并将dict元素的中值作为熊猫中的新列的主要内容,如果未能解决你的问题,请参考以下文章