Pandas 数据框中的分类变量？

Posted 2023-02-16

技术标签:

【中文标题】Pandas 数据框中的分类变量？【英文标题】：Categorical Variables In A Pandas Dataframe? 【发布时间】：2014-06-20 11:19:16 【问题描述】：

我正在阅读 Wes 的 Python For Data Analysis，但遇到了一个书中未解决的奇怪问题。

在下面的代码中，基于他的书的第 199 页，我创建了一个数据框，然后使用pd.cut() 创建cat_obj。根据书，cat_obj是

“一个特殊的分类对象。你可以把它当作一个数组指示 bin 名称的字符串；在内部它包含一个级别数组指示不同的类别名称以及标签标签属性中的年龄数据"

太棒了！但是，如果我使用完全相同的 pd.cut() 代码（在下面的 [5] 中）创建数据框的新列（称为 df['cat']），则该列不会被视为特殊的分类变量 但只是作为一个普通的熊猫系列。

那么，如何在数据框中创建一个被视为分类变量的列？

In [4]:

import pandas as pd

raw_data = 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'score': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]
df = pd.DataFrame(raw_data, columns = ['name', 'score'])

bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']

In [5]:
cat_obj = pd.cut(df['score'], bins, labels=group_names)
df['cat'] = pd.cut(df['score'], bins, labels=group_names)
In [7]:

type(cat_obj)
Out[7]:
pandas.core.categorical.Categorical
In [8]:

type(df['cat'])
Out[8]:
pandas.core.series.Series

【问题讨论】：

DataFrame 的所有列都将是 Series，您在寻找什么行为而这无法实现？ df['cat'].levels 之类的东西不起作用，但 cat_obj.levels 起作用 How to generate pandas DataFrame column of Categorical from string column?的可能重复您可以在需要时将其转换为：pd.Categorical.from_array(df['cat']).levels 目前在 pandas 中工作：github.com/pydata/pandas/pull/7217 【参考方案1】：

这可能是由于 setter- 的这种行为而发生的：

示例 getter 和 setter-

class a:
    x = 1
    @property
    def p(self):
        return int(self.x)

    @p.setter
    def p(self,v):
        self.x = v
t = 1.32
a().p = 1.32


print type(t) --> <type 'float'>
print type(a().p) --> <type 'int'>

目前df 只接受Series data，它的setter 将Categorial data 转换为Series。 df 类别支持将在 Next Pandas 版本中到期。

【讨论】：

这解释了奇怪的行为，谢谢。【参考方案2】：

目前，您不能在 Series 或 DataFrame 对象中包含分类数据，但此功能将在 Pandas 0.15（将于 9 月到期）中实现。

【讨论】：

【参考方案3】：

从http://pandas-docs.github.io/pandas-docs-travis/categorical.html，从 pandas 0.15 开始

在构造系列时指定 dtype="category"：

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

然后您可以将其添加到现有系列中。

或将现有系列或列转换为类别 dtype：

In [3]: df = pd.DataFrame("A":["a","b","c","a"])

In [4]: df["B"] = df["A"].astype('category')

In [5]: df
Out[5]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

【讨论】：

以上是关于Pandas 数据框中的分类变量？的主要内容，如果未能解决你的问题，请参考以下文章