从变量中的MULTIPLE类别创建摘要统计表
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从变量中的MULTIPLE类别创建摘要统计表相关的知识,希望对你有一定的参考价值。
我有一个看起来像这样的数据框:
ID category Household Income Tercile
1 unmarried couple 100,000 Middle
2 married couple 150,000 Bottom
3 single Female head of Household 90,000 Top
4 single Male Head of Household 80,000 Bottom
我想创建一个汇总统计表,显示按每个类别和tercile分组的每个观察的家庭收入的sd,平均值,最小值,最大值,中位数。
我能够为其中一个类别生成类似的表。这是未婚夫妇的代码:
首先,我从整个数据框中分离出类别,并删除了我不需要的变量:
status_unmarried <- merged_data %>%
select(-(person_id:is_college_graduate)) %>%
select(-(is_urban:is_owner_of_home)) %>%
filter(category == 'unmarried couple') %>%
group_by(hh_income, tercile_of_census_tract_income) %>%
distinct(hh_id, .keep_all = TRUE)
然后生成必要的摘要统计信息:
library(dplyr)
table_one <- tableby(tercile_of_census_tract_income ~ ., data =
status_unmarried)
summary(table_one, title = "Unmarried households")
我可以重申其余三个类别的这一过程。但是,我更愿意通过将所有类别聚合到一个代码块中来生成此表;而不必根据类别单独创建每个表。表或数据框看起来像这样
Unmarried Couple Married Couple Single Female Head Single Male Head
Bottom
Mean
Median
Min
Max
SD
Sample Size
Middle
Mean
Median
Median
Min
Max
SD
Sample Size
Top
Mean
Median
Min
Max
SD
Sample Size
样本大小表示每个类别的家庭数量。所以我希望列是类别,每行都是统计数据,但是要进一步划分为tercile。我想用这些结果创建一个数据框或汇总表。
提前致谢!!
考虑嵌套的基础R的by
,它提供带有分节符和标题的控制台报告:
tercile_agg_df_list <- by(random_df, random_df$Tercile, function(sub_df) {
by_list <- by(sub_df, sub_df$category, function(core_df)
with(core_df,
c(mean = mean(Household_Income), median = median(Household_Income),
min = min(Household_Income), max = max(Household_Income),
sd = sd(Household_Income), sample_size = length(Household_Income))
)
)
t(do.call(rbind, by_list))
})
tercile_agg_df_list
# random_df$Tercile: Bottom
# Married Couple Single Female Head Single Male Head Unmarried Couple
# mean 44632.894 50204.52677 58095.923 52521.3178
# median 49678.238 50042.54136 62158.775 51933.3694
# min 1989.695 95.23595 6220.779 676.9893
# max 95896.827 98471.19979 98317.740 94795.6344
# sd 29246.103 31317.47006 25728.368 28013.6172
# sample_size 35.000 56.00000 44.000 39.0000
# ----------------------------------------------------------------------------------
# random_df$Tercile: Middle
# Married Couple Single Female Head Single Male Head Unmarried Couple
# mean 56302.818 54845.140 42645.032 48222.93
# median 63245.388 51364.262 39126.608 49713.41
# min 2690.053 5286.126 3687.153 3430.90
# max 99327.726 99216.564 98645.000 98400.38
# sd 28582.935 32262.149 29996.185 28485.63
# sample_size 42.000 44.000 38.000 44.00
# ----------------------------------------------------------------------------------
# random_df$Tercile: Top
# Married Couple Single Female Head Single Male Head Unmarried Couple
# mean 51437.876 45495.1326 55150.495621 44958.808
# median 54592.978 42051.5708 56452.659052 45982.775
# min 3917.729 376.2815 1.451327 1216.967
# max 99638.078 95885.3950 99429.982156 99412.446
# sd 27627.480 26643.9194 30690.131884 29713.131
# sample_size 46.000 39.0000 31.000000 42.000
数据
set.seed(4242019)
categs <- c("Unmarried Couple", "Married Couple", "Single Female Head", "Single Male Head")
random_df <- data.frame(
category = sample(categs, 500, replace=TRUE),
Tercile = sample(c("Bottom", "Middle", "Top"), 500, replace=TRUE),
Household_Income = runif(500) * 10E4
)
head(random_df)
# category Tercile Household_Income
# 1 Unmarried Couple Bottom 70118.908
# 2 Married Couple Top 24069.175
# 3 Unmarried Couple Top 1216.967
# 4 Unmarried Couple Bottom 47936.147
# 5 Married Couple Top 80633.299
# 6 Married Couple Top 46136.093
从data.table包中尝试此代码。您可能必须使用as.data.table函数将数据帧转换为data.table。考虑到数据帧名称是dt,
dt[, .(Min=min(Income), First_quartile=quantile(Income, 0.1),
Median=quantile(Income, 0.5), Mean=mean(Income),
Third_Quartile=quantile(Income, 0.75),
Max=max(Income)) ,
by=.(Category, Tercile)]
这将以另一种格式生成表格,但我认为它更有条理。
以上是关于从变量中的MULTIPLE类别创建摘要统计表的主要内容,如果未能解决你的问题,请参考以下文章
pandas使用groupby函数按照多个分组变量进行分组聚合统计使用agg函数计算分组的多个统计指标(grouping by multiple columns in dataframe)
我的Android进阶之旅Android Studio 中 使用git提交代码报错:Can‘t commit changes from multiple changelists at once(代码片