计算行的中位数和均值(在 R 中)
Posted
技术标签:
【中文标题】计算行的中位数和均值(在 R 中)【英文标题】:Calculating the Medians and Means of Rows (in R) 【发布时间】:2021-12-14 02:02:09 【问题描述】:我正在使用 R 编程语言。假设我有以下数据(“my_data”):
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1 student1 19.70847 21.79771 16.49083 19.51691 13.97987 14.60733 13.89703 15.24651 20.75679 18.44020
2 student2 11.22369 15.36253 16.90215 20.20724 15.90227 15.14539 13.74945 18.30090 19.55124 17.24132
3 student3 15.93649 17.03599 14.20214 13.17548 14.70327 15.49697 13.08945 19.94142 22.41674 17.37958
4 student4 16.18733 15.13197 14.79481 16.75177 14.51287 17.71816 13.45054 14.25553 19.89091 18.88981
5 student5 18.71084 18.85453 17.15864 19.38880 15.68862 18.39169 15.26428 16.04526 18.92532 16.62409
6 student6 19.75246 12.74605 18.52214 17.92626 14.48501 17.20780 13.10512 12.46502 20.68583 15.87711
7 student7 14.75144 23.82376 18.51366 20.77424 14.22155 16.08186 12.95981 12.67820 20.12166 15.66006
8 student8 17.06516 15.63075 13.72026 15.02068 14.21098 15.99414 14.64818 16.15603 21.74607 17.07382
9 student9 20.27611 12.44592 12.26502 15.13456 14.61552 18.72192 15.11129 17.60746 18.83831 17.55257
10 student10 17.70736 16.21620 14.10861 17.20014 16.59376 19.50027 13.05073 15.80002 18.09781 18.34313
我想在此数据中添加 2 列:
my_mean : 每行的平均值 my_median:每行的中位数我在 R 中尝试了以下代码:
my_data$median = apply(my_data, 1, median, na.rm=T)
my_data$mean = apply(my_data, 1, mean, na.rm=T)
但我认为这段代码不正确。例如,使用此代码时,第二行数据的中位数返回为“16.90215”
但是当我手动取这一行的中位数时:
median(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132)
我得到了答案
11.22
谁能告诉我我做错了什么?
谢谢
【问题讨论】:
您的第一列是字符或因子。您需要从计算中删除该列,即apply(my_data[-1], 1, median, na.rm=TRUE)
【参考方案1】:
计算不正确,即median
的第一个参数是“x”,它可以是一个向量。第二个参数是na.rm
,后跟可变参数...
。因此,当写入11.22369, 15.36253
时,'x' 被视为11.22369
,这就是返回的值。相反,它应该是一个串联的向量c
median(c(11.22369 , 15.36253 , 16.90215 , 20.20724, 15.90227 , 15.14539 , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221
此外,根据 OP 的数据,应删除第一列,即字符或因子
apply(my_data[-1], 1, median, na.rm=TRUE)
1 2 3 4 5 6 7 8 9 10
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
第二行用于manual
计算
【讨论】:
【参考方案2】:library(dplyr)
df %>%
rowwise() %>%
mutate(median = median(c_across(where(is.numeric))),
mean = mean(c_across(where(is.numeric))))
c_across
和 rowwise
是针对这种情况创建的。大多数动词按列工作。首先将此行为管道更改为rowwise
。
c_across
然后将一行中的所有数字值组合起来(因此 where(is.numeric)
成一个数字向量,然后可以应用 mean
或 median
。
注意:由于rowwise
创建了一个按行分组的数据框,您可能希望将输出通过管道传输到ungroup
。
【讨论】:
【参考方案3】:这里是使用pmap
的替代方法,同时传递所有参数,因此使用省略号,即...
。输出需要使用来自tidyr
的unnest_wider
取消嵌套:
library(tidyr)
library(dplyr)
library(purrr)
df %>%
mutate(res = pmap(across(where(is.numeric)),
~ list(median = median(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)
输出:
student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 student1 19.7 21.8 16.5 19.5 14.0 14.6 13.9 15.2 20.8 18.4 17.5 17.4
2 student2 11.2 15.4 16.9 20.2 15.9 15.1 13.7 18.3 19.6 17.2 16.4 16.4
3 student3 15.9 17.0 14.2 13.2 14.7 15.5 13.1 19.9 22.4 17.4 15.7 16.3
4 student4 16.2 15.1 14.8 16.8 14.5 17.7 13.5 14.3 19.9 18.9 15.7 16.2
5 student5 18.7 18.9 17.2 19.4 15.7 18.4 15.3 16.0 18.9 16.6 17.8 17.5
6 student6 19.8 12.7 18.5 17.9 14.5 17.2 13.1 12.5 20.7 15.9 16.5 16.3
7 student7 14.8 23.8 18.5 20.8 14.2 16.1 13.0 12.7 20.1 15.7 15.9 17.0
8 student8 17.1 15.6 13.7 15.0 14.2 16.0 14.6 16.2 21.7 17.1 15.8 16.1
9 student9 20.3 12.4 12.3 15.1 14.6 18.7 15.1 17.6 18.8 17.6 16.3 16.3
10 student10 17.7 16.2 14.1 17.2 16.6 19.5 13.1 15.8 18.1 18.3 16.9 16.7
【讨论】:
【参考方案4】:matrixStats
库的速度绝对会让您受益匪浅。
matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180
stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
as.numeric(apply(d[-1], 1, mean, na.rm=T))))
数据:
d <- structure(list(student = c("student1", "student2", "student3",
"student4", "student5", "student6", "student7", "student8", "student9",
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733,
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736),
second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453,
12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083,
16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366,
13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724,
13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068,
15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327,
14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552,
16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816,
18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054,
15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526,
12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679,
19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166,
21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132,
17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382,
17.55257, 18.34313)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
【讨论】:
以上是关于计算行的中位数和均值(在 R 中)的主要内容,如果未能解决你的问题,请参考以下文章
R语言mad函数median函数mean函数计算中位数绝对偏差中位数均值实战
R语言vtreat包自动处理dataframe的缺失值使用分组的中位数来标准化数据列中每个数据的值(和中位数表连接并基于中位数进行数据标化)计算数据列的中位数或者均值并进行数据标准化
R语言colSums函数rowSums函数colMeans函数rowMeans函数colMedians函数rowMedians计算dataframe行或者列的加和均值中位数实战
R语言plotly可视化:plotly可视化箱图基于预先计算好的分位数均值中位数等统计指标可视化箱图箱图中添加缺口可视化均值和标准差(With Precomputed Quartiles)
R语言使用psych包的describeBy函数计算不同分组(group)的描述性统计值(样本个数均值标准差中位数剔除异常均值最小最大值数据范围极差偏度峰度均值标准差等)