pandas.cut()函数的使用
Posted 芒果去核
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas.cut()函数的使用相关的知识,希望对你有一定的参考价值。
文章目录
pandas.cut()
函数可以将数据进行分类成不同的区间值。在数据分析中,例如有一组年龄数据,现在需要对不同的年龄层次的用户进行分析,那么我们可以根据不同年龄层次所对应的年龄段来作为划分区间,例如 bins = [1,28,50,150],对应 labels = [“青少年”,“中年”,“老年”],划分完后我们就可以很容易取出不同年龄段的用户数据。不仅是年龄数据,对于需要划分区间的数据都是十分有用的。
1. 语法及参数
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
参数解释:
x:分箱时输入的数组,必须为一位数组
bins:分类依据的标准,可以是int、标量序列或间隔索引(IntervalIndex)
right:是否包含bins区间的最右边,默认为True,最右边为闭区间,False则不包含
labels:要返回的标签,和bins的区间对应
retbins:是否返回bins,当bins作为标量时使用非常有用,默认为False
precision:精度,int类型
include_lowest:第一个区间是否为左包含(左边为闭区间),默认为False,表示不包含,True则包含
duplicates:可选,默认为default 'raise', 'drop',如果 bin 边缘不是唯一的,则引发 ValueError 或删除非唯一的。
ordered:默认为True,表示标签是否有序。如果为 True,则将对生成的分类进行排序。如果为 False,则生成的分类将是无序的(必须提供标签)
2. 参数详解(含实例)
import numpy as np
import pandas as pd
2.1 bins
分类依据的标准,可以是int
、标量序列
或IntervalIndex
当bins为整数时,表示几等分
# 将数据3等分,返回的是数据中每个值所在的分类区间
pd.cut(np.array([2,6,4,8,1,5,9]),bins=3)
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
可以看到根据输入的一位数组自动划分为三个等分区间 (0.992, 3.667] 、(3.667, 6.333] 、(6.333, 9.0],根据一维数组中的值对应哪个区间,则返回对应的那个区间,比如 2 属于 (0.992, 3.667],则返回区间 (0.992, 3.667]
bins 为标量序列,以列表为例,用于指定划分区间,当x中的数据都不在指定划分区间内,返回 NaN
pd.cut(np.array([2,6,4,8,1,5,9]),bins=[1,4,7,10])
[(1.0, 4.0], (4.0, 7.0], (1.0, 4.0], (7.0, 10.0], NaN, (4.0, 7.0], (7.0, 10.0]]
Categories (3, interval[int64]): [(1, 4] < (4, 7] < (7, 10]]
当bins为间隔索引(IntervalIndex
),IntervalIndex
未涵盖的值设置为 NaN
bins = pd.IntervalIndex.from_tuples([(0, 2), (3, 6), (7, 8)]) # 创建IntervalIndex
pd.cut(np.array([2,6,4,8,1,5,9]),bins)
[(0.0, 2.0], (3.0, 6.0], (3.0, 6.0], (7.0, 8.0], (0.0, 2.0], (3.0, 6.0], NaN]
Categories (3, interval[int64]): [(0, 2] < (3, 6] < (7, 8]]
2.2 retbins
是否返回bins,当bins作为标量时使用非常有用,默认为False
# retbins=True返回等分的分类区间
pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,retbins=True)
([(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]],
array([0.992 , 3.66666667, 6.33333333, 9. ]))
可以看到返回了一个一维数组 array([0.992 , 3.66666667, 6.33333333, 9. ])),这个数组就是划分区间的依据bins,bins=[0.992 , 3.66666667, 6.33333333, 9. ]
2.3 precision
精度,int类型,表示区间值的小数位数,0和1是一样的
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=0))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=1))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=2))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,precision=3))
[(1.0, 4.0], (4.0, 6.0], (4.0, 6.0], (6.0, 9.0], (1.0, 4.0], (4.0, 6.0], (6.0, 9.0]]
Categories (3, interval[float64]): [(1.0, 4.0] < (4.0, 6.0] < (6.0, 9.0]]
==============================================================================================================
[(1.0, 3.7], (3.7, 6.3], (3.7, 6.3], (6.3, 9.0], (1.0, 3.7], (3.7, 6.3], (6.3, 9.0]]
Categories (3, interval[float64]): [(1.0, 3.7] < (3.7, 6.3] < (6.3, 9.0]]
==============================================================================================================
[(0.99, 3.67], (3.67, 6.33], (3.67, 6.33], (6.33, 9.0], (0.99, 3.67], (3.67, 6.33], (6.33, 9.0]]
Categories (3, interval[float64]): [(0.99, 3.67] < (3.67, 6.33] < (6.33, 9.0]]
==============================================================================================================
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
2.4 labels
指定返回的 bins 的标签。必须与生成的 bins 长度相同。如果为 False,则仅返回 bin 的整数指示符。当bin是 IntervalIndex
时,忽略此参数。如果为 True,则引发错误。
将等分的区间用标签labels替代,标签个数要和等分区间个数一致,几等分就几个标签
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"]))
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
==============================================================================================================
['L', 'M', 'M', 'H', 'L', 'M', 'H']
Categories (3, object): ['L' < 'M' < 'H']
将划分区间的值替换为了labels中的值,本例中"L" = (0.992, 3.667],“M”=(3.667, 6.333],“H”=(6.333, 9.0]
pd.cut(np.array([2,6,4,8,1,5,9]),bins=[1,4,7,10],labels=["L","M","H"])
['L', 'M', 'L', 'H', NaN, 'M', 'H']
Categories (3, object): ['L' < 'M' < 'H']
2.5 ordered
表示标签是否有序。默认为True,如果为 True,则将对生成的分类进行排序。如果为 False,则生成的分类将是无序的
注意:使用ordered
参数时必须和labels
参数连用,否则会报错
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"]))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"],ordered=False)) #
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,labels=["L","M","H"],ordered=True))
['L', 'M', 'M', 'H', 'L', 'M', 'H']
Categories (3, object): ['L' < 'M' < 'H']
==============================================================================================================
['L', 'M', 'M', 'H', 'L', 'M', 'H']
Categories (3, object): ['L', 'M', 'H']
==============================================================================================================
['L', 'M', 'M', 'H', 'L', 'M', 'H']
Categories (3, object): ['L' < 'M' < 'H']
[‘L’ < ‘M’ < ‘H’] 这个有序的, [‘L’, ‘M’, ‘H’] 这个是无序的
2.6 right
是否包含bins区间的最右边,默认为True,最右边为闭区间,False则不包含
# right是否包含bins区间的最右边
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3)) # 默认为True,每个区间默认为左开右闭
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,right=True)) # 每个区间左开右闭,包含每个区间的右边缘
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,right=False)) # 每个区间左闭右开,不包含每个区间的右边缘
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
==============================================================================================================
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
==============================================================================================================
[[1.0, 3.667), [3.667, 6.333), [3.667, 6.333), [6.333, 9.008), [1.0, 3.667), [3.667, 6.333), [6.333, 9.008)]
Categories (3, interval[float64]): [[1.0, 3.667) < [3.667, 6.333) < [6.333, 9.008)]
2.7 include_lowest
第一个区间是否为左包含,默认为False
,表示不包含,True
则表示包含
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,include_lowest=False))
print("="*110)
print(pd.cut(np.array([2,6,4,8,1,5,9]),bins=3,include_lowest=True))
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
==============================================================================================================
[(0.992, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.992, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.992, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
==============================================================================================================
[(0.991, 3.667], (3.667, 6.333], (3.667, 6.333], (6.333, 9.0], (0.991, 3.667], (3.667, 6.333], (6.333, 9.0]]
Categories (3, interval[float64]): [(0.991, 3.667] < (3.667, 6.333] < (6.333, 9.0]]
可以看到当include_lowest=True
,第一个区间由(0.992, 3.667]变为了(0.991, 3.667],包含了0.992
2.8 duplicates
默认值 ‘raise’, ‘drop’,如果 bin 边缘不是唯一的,则引发 ValueError
,例如以下语句
pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9])
报错信息如下:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-81-e463bd85b4bf> in <module>
1 # duplicates default 'raise', 'drop',如果 bin 边缘不是唯一的,则引发 ValueError 或删除非唯一的。
----> 2 print(pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9]))
F:\\Anaconda_all\\Anaconda\\lib\\site-packages\\pandas\\core\\reshape\\tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates, ordered)
271 raise ValueError("bins must increase monotonically.")
272
--> 273 fac, bins = _bins_to_cuts(
274 x,
275 bins,
F:\\Anaconda_all\\Anaconda\\lib\\site-packages\\pandas\\core\\reshape\\tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates, ordered)
397 if len(unique_bins) < len(bins) and len(bins) != 2:
398 if duplicates == "raise":
--> 399 raise ValueError(
400 f"Bin edges must be unique: repr(bins).\\n"
401 f"You can drop duplicate edges by setting the 'duplicates' kwarg"
ValueError: Bin edges must be unique: array([0, 3, 6, 9, 9]).
You can drop duplicate edges by setting the 'duplicates' kwarg
解决办法:使用 duplicates="drop"
去除重复
print(pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,9,9],duplicates="drop"))
[(0, 3], (3, 6], (3, 6], (6, 9], (0, 3], (6, 9], (6, 9]]
Categories (3, interval[int64]): [(0, 3] < (3, 6] < (6, 9]]
有多个重复值也是可以去除的
pd.cut(np.array([2,6,4,8,1,9,9]),bins=[0,3,6,6,9,9],duplicates="drop")
[(0, 3], (3, 6], (3, 6], (6, 9], (0, 3], (6, 9], (6, 9]]
Categories (3, interval[int64]): [(0, 3] < (3, 6] < (6, 9]]
使用 pandas.cut() 并将其设置为数据框的索引
【中文标题】使用 pandas.cut() 并将其设置为数据框的索引【英文标题】:using pandas.cut() and setting it as the index of a dataframe 【发布时间】:2018-01-25 01:30:33 【问题描述】:我正在尝试找到一种更简单的方法来使用我的数据框运行聚合函数,而不是手动提取数据并将函数与数据框本身分开运行。我有一支球队的足球统计数据,我想根据年龄进行分析和统计。我想对年龄进行分类,然后根据这些年龄组运行统计数据。更具体地说,我有一个 df:
df = pd.DataFrame('Age':[20,30,22,27,35,33,22,28,29,21,28,33,29,27,31,20,25,26,31,33,29,18],
'Goals':np.random.randint(1,6,22),
'Shots on Goals':np.random.randint(5,20,22),
'Yellow Cards':np.random.randint(1,6,22),
'Assists':np.random.randint(0,16,22))
df['Age Grps'] = pd.cut(df.Age, bins =[17,24,28,32,36])
df.set_index(['Age Grps'], inplace = True)
df.head(8)
输出以下数据框,并将索引设置为分箱年龄组:
| Age Grps | Age | Assists | Goals | Shot on Goals | Yellow Cards |
|----------|-----|---------|-------|---------------|--------------|
| (17,24] | 20 | 3 | 3 | 13 | 2 |
| (28, 32] | 30 | 2 | 3 | 11 | 3 |
| (17,24] | 22 | 10 | 3 | 14 | 5 |
| (24,28] | 27 | 3 | 1 | 16 | 3 |
| (32,36] | 35 | 1 | 4 | 5 | 1 |
| (32,36] | 33 | 5 | 4 | 17 | 1 |
| (17,24] | 22 | 14 | 5 | 13 | 3 |
| (24,28] | 28 | 14 | 2 | 7 | 4 |
是否可以按当前索引(Age Grps)进行分组以产生以下结果:
╔══════════╦═════╦═════════╦═══════╦═══════════════╦══════════════╗
║ Age Grps ║ Age ║ Assists ║ Goals ║ Shot on Goals ║ Yellow Cards ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ (17,24] ║ 20 ║ 3 ║ 3 ║ 13 ║ 2 ║
║ ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ ║ 22 ║ 14 ║ 5 ║ 13 ║ 3 ║
║ ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ ║ 22 ║ 10 ║ 3 ║ 14 ║ 5 ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ (24,28] ║ 27 ║ 3 ║ 1 ║ 16 ║ 3 ║
║ ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ ║ 28 ║ 14 ║ 2 ║ 7 ║ 4 ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ (28,32] ║ 28 ║ 14 ║ 2 ║ 7 ║ 4 ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ (32,36] ║ 35 ║ 1 ║ 4 ║ 5 ║ 1 ║
║ ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║ ║ 33 ║ 5 ║ 4 ║ 17 ║ 4 ║
╚══════════╩═════╩═════════╩═══════╩═══════════════╩══════════════╝
我想要做的是运行每个年龄段的汇总统计数据,例如每个年龄段的平均助攻数、平均进球数、平均射门数等。例如:
df['Average Goals'] = df.groupby('bucket')['Goals'].mean()
df['Average Assists'] = df.groupby('bucket')['Assists'].mean()
为了生成这样的表:
╔══════════╦═════╦═════════╦═════════════════╦═══════╦═══════════════╦═══════════════╦══════════════╗
║ Index ║ Age ║ Assists ║ Average Assists ║ Goals ║ Average Goals ║ Shot on Goals ║ Yellow Cards ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║ (17,24] ║ 20 ║ 3 ║ 9 ║ 3 ║ 3.67 ║ 13 ║ 2 ║
║ ╠═════╬═════════╣ ╬═══════╬ ╬═══════════════╬══════════════╣
║ ║ 22 ║ 14 ║ ║ 5 ║ ║ 13 ║ 3 ║
║ ╠═════╬═════════╣ ╬═══════╬ ╬═══════════════╬══════════════╣
║ ║ 22 ║ 10 ║ ║ 3 ║ ║ 14 ║ 5 ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║ (24,28] ║ 27 ║ 3 ║ 8.5 ║ 1 ║ 1.5 ║ 16 ║ 3 ║
║ ╠═════╬═════════╣ ╬═══════╬ ╬═══════════════╬══════════════╣
║ ║ 28 ║ 14 ║ ║ 2 ║ ║ 7 ║ 4 ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║ (28,32] ║ 28 ║ 14 ║ 14 ║ 2 ║ 2 ║ 7 ║ 4 ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║ (32,36] ║ 35 ║ 1 ║ 3 ║ 4 ║ 4 ║ 5 ║ 1 ║
║ ╠═════╬═════════╣ ╬═══════╬ ╬═══════════════╬══════════════╣
║ ║ 33 ║ 5 ║ ║ 4 ║ ║ 17 ║ 4 ║
╚══════════╩═════╩═════════╩═════════════════╩═══════╩═══════════════╩═══════════════╩══════════════╝
我知道我可以以列表的形式提取数据并执行我需要的统计数据,但我正试图以一种“pandorable”的方式做事。此外,我将使用 matplotlib 绘制这些数据,并且我想使用 pandas 和 matplotlib API df.plot() 的简单方法。
提前感谢您的帮助
【问题讨论】:
【参考方案1】:如果需要新列到原始df
,我认为你想要transform
,但如果从列Age Grps
设置索引,它会返回很多警告:
df['Age Grps'] = pd.cut(df.Age, bins =[17,24,28,32,36])
df = df.sort_values('Age Grps')
df['Average Goals'] = df.groupby('Age Grps')['Goals'].transform('mean')
df['Average Assists'] = df.groupby('Age Grps')['Assists'].transform('mean')
但是如果需要聚合数据使用DataFrameGroupBy.agg
:
df1 = df.groupby(pd.cut(df.Age, bins =[17,24,28,32,36]))
.agg('Goals':'mean', 'Assists':'mean', 'Yellow Cards':'sum')
print (df1)
Yellow Cards Assists Goals
Age
(17, 24] 12 8.000000 3.166667
(24, 28] 18 4.833333 1.833333
(28, 32] 21 11.333333 3.000000
(32, 36] 13 11.000000 2.250000
【讨论】:
这部分给出了我正在寻找的答案。我想用你给我的这个提示自己尝试更多的事情,然后再回来详细说明这会导致什么。感谢您为我指明正确的方向 如果我的回答对您有帮助,请不要忘记accept 它 - 单击答案旁边的复选标记 (✓
) 将其从灰色切换为已填充。谢谢。跨度>
完美,我可以使用您上面提到的内容,我只是想避免对每一列都执行类似 avg_goals = list(df.Goals).mean() 之类的操作,因为我拥有的远不止这些上面列出的,那将是解决这个问题的一种非常费力的方法。再次感谢
我不确定是否理解。你需要out = df.mean()
来表示所有数字列的平均值吗?
不,我的第三个数据框如上所示,正是我想要完成的。我基本上是在尝试对 3 支不同的足球队(联赛冠军、联赛中队和联赛末位球队)进行分析,并确定球员年龄之间是否存在相关性球队和球队在联赛中的排名。这就是为什么我按年龄分组。然后我想比较三支球队中这些年龄组的某些统计数据。以上是关于pandas.cut()函数的使用的主要内容,如果未能解决你的问题,请参考以下文章