从变量中的MULTIPLE类别创建摘要统计表

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从变量中的MULTIPLE类别创建摘要统计表相关的知识,希望对你有一定的参考价值。

我有一个看起来像这样的数据框:

 ID    category                          Household Income     Tercile   
  1     unmarried couple                    100,000             Middle
  2     married couple                      150,000             Bottom
  3     single Female head of Household     90,000              Top
  4     single Male Head of Household       80,000              Bottom

我想创建一个汇总统计表,显示按每个类别和tercile分组的每个观察的家庭收入的sd,平均值,最小值,最大值,中位数。

我能够为其中一个类别生成类似的表。这是未婚夫妇的代码:

首先,我从整个数据框中分离出类别,并删除了我不需要的变量:

status_unmarried <- merged_data %>% 
select(-(person_id:is_college_graduate)) %>%
select(-(is_urban:is_owner_of_home)) %>%
filter(category == 'unmarried couple') %>%
group_by(hh_income, tercile_of_census_tract_income) %>% 
distinct(hh_id, .keep_all = TRUE)

然后生成必要的摘要统计信息:

library(dplyr)
table_one <- tableby(tercile_of_census_tract_income ~ ., data = 
status_unmarried)
summary(table_one, title = "Unmarried households")

我可以重申其余三个类别的这一过程。但是,我更愿意通过将所有类别聚合到一个代码块中来生成此表;而不必根据类别单独创建每个表。表或数据框看起来像这样

        Unmarried Couple   Married Couple  Single Female Head Single Male Head

Bottom
Mean
Median
Min
Max
SD
Sample Size

Middle
Mean
Median
Median
Min
Max
SD
Sample Size

Top
Mean
Median
Min
Max
SD
Sample Size

样本大小表示每个类别的家庭数量。所以我希望列是类别,每行都是统计数据,但是要进一步划分为tercile。我想用这些结果创建一个数据框或汇总表。

提前致谢!!

答案

考虑嵌套的基础R的by,它提供带有分节符和标题的控制台报告:

tercile_agg_df_list <- by(random_df, random_df$Tercile, function(sub_df) {
   by_list <- by(sub_df, sub_df$category, function(core_df)          
     with(core_df,
          c(mean = mean(Household_Income),  median = median(Household_Income), 
            min = min(Household_Income), max = max(Household_Income),
            sd = sd(Household_Income), sample_size = length(Household_Income))
         )
     )       
   t(do.call(rbind, by_list))
})

tercile_agg_df_list
# random_df$Tercile: Bottom
#             Married Couple Single Female Head Single Male Head Unmarried Couple
# mean             44632.894        50204.52677        58095.923       52521.3178
# median           49678.238        50042.54136        62158.775       51933.3694
# min               1989.695           95.23595         6220.779         676.9893
# max              95896.827        98471.19979        98317.740       94795.6344
# sd               29246.103        31317.47006        25728.368       28013.6172
# sample_size         35.000           56.00000           44.000          39.0000
# ---------------------------------------------------------------------------------- 
# random_df$Tercile: Middle
#             Married Couple Single Female Head Single Male Head Unmarried Couple
# mean             56302.818          54845.140        42645.032         48222.93
# median           63245.388          51364.262        39126.608         49713.41
# min               2690.053           5286.126         3687.153          3430.90
# max              99327.726          99216.564        98645.000         98400.38
# sd               28582.935          32262.149        29996.185         28485.63
# sample_size         42.000             44.000           38.000            44.00
# ---------------------------------------------------------------------------------- 
# random_df$Tercile: Top
#             Married Couple Single Female Head Single Male Head Unmarried Couple
# mean             51437.876         45495.1326     55150.495621        44958.808
# median           54592.978         42051.5708     56452.659052        45982.775
# min               3917.729           376.2815         1.451327         1216.967
# max              99638.078         95885.3950     99429.982156        99412.446
# sd               27627.480         26643.9194     30690.131884        29713.131
# sample_size         46.000            39.0000        31.000000           42.000

数据

set.seed(4242019)
categs <- c("Unmarried Couple", "Married Couple", "Single Female Head", "Single Male Head")

random_df <- data.frame(
  category = sample(categs, 500, replace=TRUE),
  Tercile = sample(c("Bottom", "Middle", "Top"), 500, replace=TRUE),
  Household_Income = runif(500) * 10E4
)

head(random_df)
#           category Tercile Household_Income
# 1 Unmarried Couple  Bottom        70118.908
# 2   Married Couple     Top        24069.175
# 3 Unmarried Couple     Top         1216.967
# 4 Unmarried Couple  Bottom        47936.147
# 5   Married Couple     Top        80633.299
# 6   Married Couple     Top        46136.093
另一答案

从data.table包中尝试此代码。您可能必须使用as.data.table函数将数据帧转换为data.table。考虑到数据帧名称是dt,

dt[, .(Min=min(Income), First_quartile=quantile(Income, 0.1),
   Median=quantile(Income, 0.5), Mean=mean(Income),
   Third_Quartile=quantile(Income, 0.75),
   Max=max(Income)) ,
by=.(Category, Tercile)]

这将以另一种格式生成表格,但我认为它更有条理。

以上是关于从变量中的MULTIPLE类别创建摘要统计表的主要内容,如果未能解决你的问题,请参考以下文章

为啥我在 R 中的摘要只包括我的一些变量?

pandas使用groupby函数按照多个分组变量进行分组聚合统计使用agg函数计算分组的多个统计指标(grouping by multiple columns in dataframe)

我的Android进阶之旅Android Studio 中 使用git提交代码报错:Can‘t commit changes from multiple changelists at once(代码片

多示例学习 multiple instance learning (MIL)

hive从入门到实战五

hive从入门到实战五