计算行的中位数和均值（在 R 中）

Posted 2023-03-24

技术标签:

【中文标题】计算行的中位数和均值（在 R 中）【英文标题】：Calculating the Medians and Means of Rows (in R) 【发布时间】：2021-12-14 02:02:09 【问题描述】：

我正在使用 R 编程语言。假设我有以下数据（“my_data”）：

   student first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run
1   student1  19.70847   21.79771  16.49083   19.51691  13.97987  14.60733    13.89703  15.24651  20.75679  18.44020
2   student2  11.22369   15.36253  16.90215   20.20724  15.90227  15.14539    13.74945  18.30090  19.55124  17.24132
3   student3  15.93649   17.03599  14.20214   13.17548  14.70327  15.49697    13.08945  19.94142  22.41674  17.37958
4   student4  16.18733   15.13197  14.79481   16.75177  14.51287  17.71816    13.45054  14.25553  19.89091  18.88981
5   student5  18.71084   18.85453  17.15864   19.38880  15.68862  18.39169    15.26428  16.04526  18.92532  16.62409
6   student6  19.75246   12.74605  18.52214   17.92626  14.48501  17.20780    13.10512  12.46502  20.68583  15.87711
7   student7  14.75144   23.82376  18.51366   20.77424  14.22155  16.08186    12.95981  12.67820  20.12166  15.66006
8   student8  17.06516   15.63075  13.72026   15.02068  14.21098  15.99414    14.64818  16.15603  21.74607  17.07382
9   student9  20.27611   12.44592  12.26502   15.13456  14.61552  18.72192    15.11129  17.60746  18.83831  17.55257
10 student10  17.70736   16.21620  14.10861   17.20014  16.59376  19.50027    13.05073  15.80002  18.09781  18.34313

我想在此数据中添加 2 列：

my_mean : 每行的平均值 my_median：每行的中位数

我在 R 中尝试了以下代码：

my_data$median = apply(my_data, 1, median, na.rm=T)

my_data$mean = apply(my_data, 1, mean, na.rm=T)

但我认为这段代码不正确。例如，使用此代码时，第二行数据的中位数返回为“16.90215”

但是当我手动取这一行的中位数时：

median(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132)

我得到了答案

11.22

谁能告诉我我做错了什么？

谢谢

【问题讨论】：

您的第一列是字符或因子。您需要从计算中删除该列，即apply(my_data[-1], 1, median, na.rm=TRUE) 【参考方案1】：

计算不正确，即median 的第一个参数是“x”，它可以是一个向量。第二个参数是na.rm，后跟可变参数...。因此，当写入11.22369, 15.36253 时，'x' 被视为11.22369，这就是返回的值。相反，它应该是一个串联的向量c

median(c(11.22369  , 15.36253 , 16.90215 ,  20.20724,  15.90227 , 15.14539   , 13.74945 , 18.30090 , 19.55124 , 17.24132))
[1] 16.40221

此外，根据 OP 的数据，应删除第一列，即字符或因子

 apply(my_data[-1], 1, median, na.rm=TRUE)
       1        2        3        4        5        6        7        8        9       10 
17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695

第二行用于manual计算

【讨论】：

【参考方案2】：

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(median = median(c_across(where(is.numeric))),
         mean = mean(c_across(where(is.numeric))))

c_across 和 rowwise 是针对这种情况创建的。大多数动词按列工作。首先将此行为管道更改为rowwise。

c_across 然后将一行中的所有数字值组合起来（因此 where(is.numeric) 成一个数字向量，然后可以应用 mean 或 median。

注意：由于rowwise 创建了一个按行分组的数据框，您可能希望将输出通过管道传输到ungroup。

【讨论】：

【参考方案3】：

这里是使用pmap 的替代方法，同时传递所有参数，因此使用省略号，即...。输出需要使用来自tidyr 的unnest_wider 取消嵌套：

library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  mutate(res = pmap(across(where(is.numeric)),
                    ~ list(median = median(c(...)),
                           avg = mean(c(...))))) %>%
  unnest_wider(res)

输出：

  student   first_run second_run third_run fourth_run fifth_run sixth_run seventh_run eight_run ninth_run tenth_run median   avg
   <chr>         <dbl>      <dbl>     <dbl>      <dbl>     <dbl>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>  <dbl> <dbl>
 1 student1       19.7       21.8      16.5       19.5      14.0      14.6        13.9      15.2      20.8      18.4   17.5  17.4
 2 student2       11.2       15.4      16.9       20.2      15.9      15.1        13.7      18.3      19.6      17.2   16.4  16.4
 3 student3       15.9       17.0      14.2       13.2      14.7      15.5        13.1      19.9      22.4      17.4   15.7  16.3
 4 student4       16.2       15.1      14.8       16.8      14.5      17.7        13.5      14.3      19.9      18.9   15.7  16.2
 5 student5       18.7       18.9      17.2       19.4      15.7      18.4        15.3      16.0      18.9      16.6   17.8  17.5
 6 student6       19.8       12.7      18.5       17.9      14.5      17.2        13.1      12.5      20.7      15.9   16.5  16.3
 7 student7       14.8       23.8      18.5       20.8      14.2      16.1        13.0      12.7      20.1      15.7   15.9  17.0
 8 student8       17.1       15.6      13.7       15.0      14.2      16.0        14.6      16.2      21.7      17.1   15.8  16.1
 9 student9       20.3       12.4      12.3       15.1      14.6      18.7        15.1      17.6      18.8      17.6   16.3  16.3
10 student10      17.7       16.2      14.1       17.2      16.6      19.5        13.1      15.8      18.1      18.3   16.9  16.7

【讨论】：

【参考方案4】：

matrixStats 库的速度绝对会让您受益匪浅。

matrixStats::rowMedians(as.matrix(d[-1]))
# [1] 17.46551 16.40221 15.71673 15.65965 17.77517 16.54246 15.87096 15.81245 16.34356 16.89695
matrixStats::rowMeans2(as.matrix(d[-1]))
# [1] 17.44417 16.35862 16.33775 16.15837 17.50521 16.27728 16.95862 16.12661 16.25687 16.66180

stopifnot(all.equal(matrixStats::rowMedians(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, median, na.rm=T))))
stopifnot(all.equal(matrixStats::rowMeans2(as.matrix(d[-1])),
                    as.numeric(apply(d[-1], 1, mean, na.rm=T))))

数据：

d <- structure(list(student = c("student1", "student2", "student3", 
"student4", "student5", "student6", "student7", "student8", "student9", 
"student10"), first_run = c(19.70847, 11.22369, 15.93649, 16.18733, 
18.71084, 19.75246, 14.75144, 17.06516, 20.27611, 17.70736), 
    second_run = c(21.79771, 15.36253, 17.03599, 15.13197, 18.85453, 
    12.74605, 23.82376, 15.63075, 12.44592, 16.2162), third_run = c(16.49083, 
    16.90215, 14.20214, 14.79481, 17.15864, 18.52214, 18.51366, 
    13.72026, 12.26502, 14.10861), fourth_run = c(19.51691, 20.20724, 
    13.17548, 16.75177, 19.3888, 17.92626, 20.77424, 15.02068, 
    15.13456, 17.20014), fifth_run = c(13.97987, 15.90227, 14.70327, 
    14.51287, 15.68862, 14.48501, 14.22155, 14.21098, 14.61552, 
    16.59376), sixth_run = c(14.60733, 15.14539, 15.49697, 17.71816, 
    18.39169, 17.2078, 16.08186, 15.99414, 18.72192, 19.50027
    ), seventh_run = c(13.89703, 13.74945, 13.08945, 13.45054, 
    15.26428, 13.10512, 12.95981, 14.64818, 15.11129, 13.05073
    ), eight_run = c(15.24651, 18.3009, 19.94142, 14.25553, 16.04526, 
    12.46502, 12.6782, 16.15603, 17.60746, 15.80002), ninth_run = c(20.75679, 
    19.55124, 22.41674, 19.89091, 18.92532, 20.68583, 20.12166, 
    21.74607, 18.83831, 18.09781), tenth_run = c(18.4402, 17.24132, 
    17.37958, 18.88981, 16.62409, 15.87711, 15.66006, 17.07382, 
    17.55257, 18.34313)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

【讨论】：

以上是关于计算行的中位数和均值（在 R 中）的主要内容，如果未能解决你的问题，请参考以下文章