如何获得与R一样的Pandas数据帧的类似摘要?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何获得与R一样的Pandas数据帧的类似摘要?相关的知识,希望对你有一定的参考价值。
不同的尺度允许不同类型的操作。我想指定数据框df
中列的比例。然后,df.describe()
应该考虑到这一点。
例子
- 标称比例:名义比例仅允许检查等效性。这方面的例子是性别,姓名,城市名称。您基本上只能计算它们出现的频率并给出最常见的(模式)。
- 序数尺度:你可以订购,但不能说一个人离另一个人有多远。布料尺寸是一个例子。您可以计算此比例的中位数/分钟/最大值。
- 定量尺度:您可以计算这些尺度的平均值,标准偏差,分位数。
代码示例
import pandas as pd
import pandas.rpy.common as rcom
df = rcom.load_data('mtcars')
print(df.describe())
给
mpg cyl disp hp drat wt
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000
qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.0000
mean 17.848750 0.437500 0.406250 3.687500 2.8125
std 1.786943 0.504016 0.498991 0.737804 1.6152
min 14.500000 0.000000 0.000000 3.000000 1.0000
25% 16.892500 0.000000 0.000000 3.000000 2.0000
50% 17.710000 0.000000 0.000000 4.000000 2.0000
75% 18.900000 1.000000 1.000000 4.000000 4.0000
max 22.900000 1.000000 1.000000 5.000000 8.0000
这不好,因为vs
是一个二进制变量,表明汽车是否有V引擎或直引擎(source)。因此,该特征具有标称规模。因此min / max / std / mean不适用。应该计算0和1出现的频率。
在R中,您可以执行以下操作:
mtcars$vs = factor(mtcars$vs, levels=c(0, 1), labels=c("straight engine", "V-Engine"))
mtcars$am = factor(mtcars$am, levels=c(0, 1), labels=c("Automatic", "Manual"))
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
summary(mtcars)
得到
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear carb
Min. :1.513 Min. :14.50 straight engine:18 Automatic:19 3:15 1: 7
1st Qu.:2.581 1st Qu.:16.89 V-Engine :14 Manual :13 4:12 2:10
Median :3.325 Median :17.71 5: 5 3: 3
Mean :3.217 Mean :17.85 4:10
3rd Qu.:3.610 3rd Qu.:18.90 6: 1
Max. :5.424 Max. :22.90 8: 1
熊猫也有类似的东西吗?
我试过了
df["vs"] = df["vs"].astype('category')
但这使得"vs"
从描述中消失了。
晚了,但我最近碰巧遇到了一些相同的问题,所以我想我会分享我对这个挑战的看法。
在我看来,R在处理分类变量方面仍然更好。但是,有一些方法可以使用Python与pd.Categorical()
,pd.GetDummies()
和describe()
模仿这些功能。
这个特定数据集的挑战是分类变量具有非常不同的属性。例如,am is 0 or 1
分别用于自动或手动齿轮。和gear is either 3, 4, or 5
,但仍然最合理地被认为是分类而不是数值。因此,对于am
,我会用'自动'和'分类'替换0和1,但对于齿轮我会应用pd.GetDummies()
为每类齿轮获得0或1,以便能够轻松计算有多少模型,例如,3档。
我有一个实用功能躺了一段时间,我昨天有所改善。它肯定不是最高级的,但它应该给你与使用R片段相同的信息。最终输出表由行数不等的列组成。我没有将一个类似的表作为数据框并用NaN填充,而是将信息分成两部分:一个表用于数值,另一个表用于分类值,所以你最终得到这个:
count
Straight Engine 18
V engine 14
automatic 13
manual 19
cyl_4 11
cyl_6 7
cyl_8 14
gear_3 15
gear_4 12
gear_5 5
carb_1 7
carb_2 10
carb_3 3
carb_4 10
carb_6 1
carb_8 1
mpg disp hp drat wt qsec
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 230.721875 146.687500 3.596563 3.217250 17.848750
std 6.026948 123.938694 68.562868 0.534679 0.978457 1.786943
min 10.400000 71.100000 52.000000 2.760000 1.513000 14.500000
25% 15.425000 120.825000 96.500000 3.080000 2.581250 16.892500
50% 19.200000 196.300000 123.000000 3.695000 3.325000 17.710000
75% 22.800000 326.000000 180.000000 3.920000 3.610000 18.900000
max 33.900000 472.000000 335.000000 4.930000 5.424000 22.900000
这是简单复制和粘贴的整个过程:
# imports
import pandas as pd
# to easily access R datasets:
# pip install pydataset
from pydataset import data
# Load dataset
df_mtcars = data('mtcars')
# The following variables: cat, dum, num and recoding
# are used in the function describeCat/df, dummies, recode, categorical) below
# Specify which variables are dummy variables [0 or 1],
# ategorical [multiple categories] or numeric
cat = ['cyl', 'gear', 'carb']
dum = ['vs', 'am']
num = [c for c in list(df_mtcars) if c not in cat+dum]
# Also, define a dictionary that describes how some dummy variables should be recoded
# For example, in the series am, 0 is recoded as automatic and 1 as manual gears
recoding = {'am':['manual', 'automatic'], 'vs':['Straight Engine', 'V engine']}
# The function:
def describeCat(df, dummies, recode, categorical):
""" Retrieves specified dummy and categorical variables
from a pandas DataFrame and describes them (just count for now).
Dummy variables [0 or 1] can be recoded to categorical variables
by specifying a dictionary
Keyword arguments:
df -- pandas DataFrame
dummies -- list of column names to specify dummy variables [0 or 1]
recode -- dictionary to specify which and how dummyvariables should be recoded
categorical -- list of columns names to specify catgorical variables
"""
# Recode dummy variables
recoded = []
# DataFrame to store recoded variables
df_recoded = pd.DataFrame()
for dummy in dummies:
if dummy in recode.keys():
dummySeries = df[dummy].copy(deep = True).to_frame()
dummySeries[dummy][dummySeries[dummy] == 0] = recode[dummy][0]
dummySeries[dummy][dummySeries[dummy] == 1] = recode[dummy][1]
recoded.append(pd.Categorical(dummySeries[dummy]).describe())
df_rec = pd.DataFrame(pd.Categorical(dummySeries[dummy]).describe())
df_recoded = pd.concat([df_recoded.reset_index(),df_rec.reset_index()],
ignore_index=True).set_index('categories')
df_recoded = df_recoded['counts'].to_frame()
# Rename columns and change datatype
df_recoded['counts'] = df_recoded['counts'].astype(int)
df_recoded.columns = ['count']
# Since categorical variables will be transformed into dummy variables,
# all remaining dummy variables (after recoding) can be treated the
# same way as the categorical variables
unrecoded = [var for var in dum if var not in recoding.keys()]
categorical = categorical + unrecoded
# Categorical split into dummy variables will have the same index
# as the original dataframe
allCats = pd.DataFrame(index = df.index)
# apply pd.get_dummies on all categoirical variables
for cat in categorical:
newCats = pd.DataFrame(data = pd.get_dummies(pd.Categorical(df_mtcars[cat]), prefix = cat))
newCats.index = df_mtcars.index
allCats = pd.concat([allCats, newCats], axis = 1)
df_cat = allCats.sum().to_frame()
df_cat.columns = ['count']
# gather output dataframes
df_output = pd.concat([df_recoded, df_cat], axis = 0)
return(df_output)
# Test run: Build a dataframe that describes the dummy and categorical variables
df_categorical = describeCat(df = df_mtcars, dummies = dum, recode = recoding, categorical = cat)
# describe numerical variables
df_numerical = df_mtcars[num].describe()
print(df_categorical)
print(df_numerical)
关于分类变量和describe()的旁注:
我在上面的函数中使用pd.Categorical()
的原因是describe()
的输出似乎有点不稳定。有时df_mtcars['gear'].astype('category').describe()
返回:
count 32.000000
mean 3.687500
std 0.737804
min 3.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 5.000000
Name: gear, dtype: float64
虽然它应该被认为是一个分类变量,但它应该返回:
count 32
unique 3
top 3
freq 15
Name: gear, dtype: int64
我可能在这里错了,我在复制这个问题时遇到了问题,但我可以发誓这种情况时有发生。
在describe()
上使用pd.Categorical()
给出了它自己格式的输出,但至少它似乎是稳定的。
counts freqs
categories
3 15 0.46875
4 12 0.37500
5 5 0.15625
关于pd.get_dummies()的最后一句话
以下是将该函数应用于df_mtcars['gear']
时会发生的情况:
# code
pd.get_dummies(df_mtcars['gear'].astype('category'), prefix = 'gear')
# output
gear_3 gear_4 gear_5
Mazda RX4 0 1 0
Mazda RX4 Wag 0 1 0
Datsun 710 0 1 0
Hornet 4 Drive 1 0 0
Hornet Sportabout 1 0 0
Valiant 1 0 0
.
.
.
Ferrari Dino 0 0 1
Maserati Bora 0 0 1
Volvo 142E 0 1 0
但在这种情况下,我只需使用value_counts()
,以便您获得以下内容:
counts freqs
categories
3 15 0.46875
4 12 0.37500
5 5 0.15625
这也恰好类似于在describe()
变量上使用pd.Categorical()
的输出。
我遇到了同样的问题。 df.describe()
适用于数值。
为了计算类别中的值,我写了这段代码:
for category in df.columns:
print('
',category)
for typ in df.groupby(category).groups:
print(typ,' ',len(df.groupby(category).groups[typ]))
我希望它会有所帮助:)
以上是关于如何获得与R一样的Pandas数据帧的类似摘要?的主要内容,如果未能解决你的问题,请参考以下文章
`pd.concat` 与 `join=='inner'` 不会产生 pandas 数据帧的交集