Pandas - 计算每个客户对每个特定产品的购买次数
Posted
技术标签:
【中文标题】Pandas - 计算每个客户对每个特定产品的购买次数【英文标题】:Pandas - Count number of purchase for each customer for each specific product 【发布时间】:2022-01-11 02:00:22 【问题描述】:JSON 文件中的输入数据、交易历史记录:
"customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-09-01 11:09:00"
"customer_id": "C27", "basket": ["product_id": "P57", "price": 154, "product_id": "P42", "price": 349, "product_id": "P47", "price": 180], "date_of_purchase": "2021-09-06 04:52:08.505909"
"customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-10-01 11:09:00"
数据框:
customer_id basket date_of_purchase
0 C4 ['product_id': 'P31', 'price': 26] 2021-09-06 05:47:08.505909
1 C13 ['product_id': 'P36', 'price': 566] 2021-09-06 03:52:08.505909
2 C15 ['product_id': 'P02', 'price': 839] 2021-09-06 05:48:08.505909
3 C22 ['product_id': 'P37', 'price': 1235] 2021-09-05 20:52:08.505909
4 C27 ['product_id': 'P57', 'price': 154, 'produc... 2021-09-06 04:52:08.505909
我将 JSON 读入数据框的代码:
def read_json_folder(json_folder: str):
transactions_files = glob.glob("*/*.json".format(json_folder))
return pandas.concat(pandas.read_json(tf, lines=True) for tf in transactions_files)
对于每笔交易,我都需要客户 ID 以及他们购买特定产品的次数。
预期输出:
customer_id product_id purchase_count
C1 P2 11
C1 P3 5
C2 P9 7
【问题讨论】:
你的数据框中已经有 JSON 了吗? @user17242583 是的,它已经在数据框中了。 你是怎么弄进去的?像这样?pd.json_normalize(j, record_path='basket', meta='customer_id')
(j
是 json 对象的列表)
【参考方案1】:
从数据构建数据框
read_json 带行参数 按篮子“行”展开篮子列表 在产品 ID 和价格中分解产品信息 删除不需要的列从 df 构建结果数据框
分组和计数 重命名计数列>>>TESTDATA="""
..."customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-09-01 11:09:00"
..."customer_id": "C27", "basket": ["product_id": "P57", "price": 154, "product_id": "P42", "price": 349, "product_id": "P47", "price": 180], "date_of_purchase": "2021-09-06 04:52:08.505909"
..."customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-10-01 11:09:00"
..."""
>>>df = pd.read_json(TESTDATA, lines=True)
>>>df = df.explode('basket')
>>>df[['product_id', 'price']] = df['basket'].apply(pd.Series)
>>>df.drop(['basket', 'price'], axis=1, inplace=True)
>>>df2 = df.groupby(['customer_id', 'product_id'], as_index=False).count()
>>>df2.rename(columns='date_of_purchase': 'purchase_count', inplace=True)
>>>df2
customer_id product_id purchase_count
0 C1 P3 2
1 C1 P4 2
2 C27 P42 1
3 C27 P47 1
4 C27 P57 1
【讨论】:
第三列应该是 purchase_count 而不是 date_of_purchase @Casper2210 ,我加了一行重命名【参考方案2】:如果你的数据框是这样的:
shop_list = [
"customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-09-01 11:09:00",
"customer_id": "C27", "basket": ["product_id": "P57", "price": 154, "product_id": "P42", "price": 349, "product_id": "P47", "price": 180], "date_of_purchase": "2021-09-06 04:52:08.505909",
"customer_id": "C1", "basket": ["product_id": "P3", "price": 506, "product_id": "P4", "price": 121], "date_of_purchase": "2018-10-01 11:09:00"
]
shop = pd.DataFrame(shop_list)
首先让每个客户获得所有产品位置
customer_groupby = shop.groupby('customer_id')['basket'].apply(list).to_dict()
for k in customer_groupby.keys():
customer_groupby[k] = [item['product_id'] for sublist in customer_groupby[k] for item in sublist]
output:
#'C1': ['P3', 'P4', 'P3', 'P4'], 'C27': ['P57', 'P42', 'P47']
然后创建结果表:
table= pd.DataFrame(columns=['customer_id', 'product_id', 'purchase_count'])
for customer,value in customer_groupby.items():
items = set(value)
for item in items:
table= table.append('customer_id':customer, 'product_id':item, 'purchase_count':value.count(item), ignore_index=True)
最终结果:
【讨论】:
这个解决方案能回答你的问题吗?@Casper2210【参考方案3】:试试这个:
purchase_counts = df.groupby(['customer_id', 'product_id'], as_index=False).count()
输出:
>>> purchase_counts
customer_id product_id price
0 C1 P3 2
1 C1 P4 2
2 C27 P42 1
3 C27 P47 1
4 C27 P57 1
【讨论】:
如果我的代码不适合您,您能否在问题中添加一个数据框示例?以上是关于Pandas - 计算每个客户对每个特定产品的购买次数的主要内容,如果未能解决你的问题,请参考以下文章
pandas使用resample函数计算每个月的统计均值使用matplotlib可视化特定年份的按月均值