如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?

Posted

技术标签:

【中文标题】如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?【英文标题】:How to convert the data structure obtained after performing a groupby operation on a pandas dataframe into a dataframe? 【发布时间】:2019-01-21 03:17:31 【问题描述】:

假设我有来自示例 here 的数据集:

import pandas as pd

raw_data = 'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

我想做一个regimentpreTestScore 的箱线图。为此,我需要找出这两个变量的相对分布。所以,我将regimentpreTestScore 分组:

df1 = df['regiment'].groupby(df['preTestScore']).count()
df1

preTestScore
2     3
3     3
4     2
24    2
31    2
Name: regiment, dtype: int64

如果我现在尝试做箱线图,它会给出错误:

import seaborn as sns
sns.boxplot(data=df1)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-131-8296ca940a25> in <module>()
      1 df1 = df['regiment'].groupby(df['preTestScore']).count()
      2 df1
----> 3 sns.boxplot(data=df1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
   2209     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2210                           orient, color, palette, saturation,
-> 2211                           width, dodge, fliersize, linewidth)
   2212 
   2213     if ax is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
    439                  width, dodge, fliersize, linewidth):
    440 
--> 441         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    442         self.establish_colors(color, palette, saturation)
    443 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
     94                 if hasattr(data, "shape"):
     95                     if len(data.shape) == 1:
---> 96                         if np.isscalar(data[0]):
     97                             plot_data = [data]
     98                         else:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    765         key = com._apply_if_callable(key, self)
    766         try:
--> 767             result = self.index.get_value(self, key)
    768 
    769             if not is_scalar(result):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   3116         try:
   3117             return self._engine.get_value(s, k,
-> 3118                                           tz=getattr(series.dtype, 'tz', None))
   3119         except KeyError as e1:
   3120             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

所以,我将 groupby 对象转换为数据框并再次尝试箱线图:

df1 = pd.DataFrame(df1)
df1

sns.boxplot(data=df1)

这会产生一个箱线图,但分布不是regimentpreTestScore 的分布(事实上,这个箱线图对我来说没有意义;我不知道它的y 轴值代表什么) .为此,我们需要在箱线图中指定xy 参数。但是,由于 groupby 对象不是数据框,因此会产生以下错误:

sns.boxplot(x='regiment', y='preTestScore', data=df1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-132-fc8036eb7d0b> in <module>()
----> 1 sns.boxplot(x='regiment', y='preTestScore', data=df1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
   2209     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2210                           orient, color, palette, saturation,
-> 2211                           width, dodge, fliersize, linewidth)
   2212 
   2213     if ax is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
    439                  width, dodge, fliersize, linewidth):
    440 
--> 441         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    442         self.establish_colors(color, palette, saturation)
    443 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    149                 if isinstance(input, string_types):
    150                     err = "Could not interpret input ''".format(input)
--> 151                     raise ValueError(err)
    152 
    153             # Figure out the plotting orientation

ValueError: Could not interpret input 'regiment'

我们可以通过以下方式检查df1的数据类型:

df1.dtype
>>> dtype('int64')

当我将 df1 中的值放入一个新的数据框 df2 中,然后再次尝试箱线图时,它可以工作:

df2 = pd.DataFrame('preTestScore': [2,3,4,24,31], 'regiment': [3,3,2,2,2])
df2

sns.boxplot(x='regiment', y='preTestScore', data=df2)

那么,与其把groupby对象的内容复制粘贴到一个新的dataframe中,不如直接获取一个dataframe来存储一个dataframe中两个变量的相对分布?

【问题讨论】:

【参考方案1】:

使用to_frame将Series转换为DataFrame,然后在绘图前重置索引:

df1 = df['regiment'].groupby(df['preTestScore']).count().to_frame().reset_index()
sns.boxplot(x='regiment', y='preTestScore', data=df1)

【讨论】:

不相关,但你能说一下如何使用groupby 和固定数量的垃圾箱吗?这在regiment 具有>10 个值并且我们希望将数据帧分组为3 个regiment 值的情况下会很有帮助。 @Kristada673:只需创建一个包含分箱值的新列并按该列分组。有关如何进行分箱的示例,请参见此处:***.com/questions/45273731/…

以上是关于如何将pandas dataframe进行groupby操作后得到的数据结构转换为dataframe?的主要内容,如果未能解决你的问题,请参考以下文章

如何将 GitHub 制作成 pandas DataFrame? [复制]

Pandas - 分组统计

python - 如何在Python中将pandas DataFrame与None进行比较?

如何在 pct_change 计算中对 pandas DataFrame 中的多列进行分组

如何将 Pandas Dataframe 中的字符串转换为列表或字符数组?

Pandas:如何将 MultiIndex DataFrame 与单个索引 DataFrame 连接,以及自定义排序