计算R中具有相似值的行的平均值
Posted
技术标签:
【中文标题】计算R中具有相似值的行的平均值【英文标题】:Calculate the average of rows with similar values in R 【发布时间】:2021-01-12 21:54:23 【问题描述】:我正在处理一个关于 22 个月内英国地区失业率的数据集。
我将原始数据集一分为二:一个包含失业率较高的区域 (df1),另一个包含失业率较低的区域 (df2 >)。
期望的输出对他们来说是一样的,所以我只发布df1的结构:
df1 目前包括五个地区的每月失业率:
东北 伦敦 约克郡和亨伯 东米德兰兹 西米德兰兹我想计算每个地区每个月的平均失业率(即 1 月 19 日、2 月 20 日(一直到 10 月 20 日)的东北部、伦敦等地的平均失业率。
关键是,一旦我将所有地区的平均失业率汇总为一个,我就可以有一个图而不是五个不同的图。
预期输出:
Date | Region | Unemployment rate
01-2019 | ABC | AJan_19+B_Jan19+C_Jan19 / 3
02-2019 | ABC | AFeb_19+B_Feb19+C_Feb19 / 3
03-2019 | ABC | AMar_19+B_Feb19+C_Feb19 / 3
等等
因此,我不是每个月有 5 个值(即每个区域一个值),而是将区域的值相加,然后除以每个月的区域数。
这是df1的结构
structure(list(
Date = structure(c(17897, 17897, 17897, 17897,
17897, 17928, 17928, 17928, 17928, 17928, 17956, 17956, 17956,
17956, 17956, 17987, 17987, 17987, 17987, 17987, 18017, 18017,
18017, 18017, 18017, 18048, 18048, 18048, 18048, 18048, 18078,
18078, 18078, 18078, 18078, 18109, 18109, 18109, 18109, 18109,
18140, 18140, 18140, 18140, 18140, 18170, 18170, 18170, 18170,
18170, 18201, 18201, 18201, 18201, 18201, 18231, 18231, 18231,
18231, 18231, 18262, 18262, 18262, 18262, 18262, 18293, 18293,
18293, 18293, 18293, 18322, 18322, 18322, 18322, 18322, 18353,
18353, 18353, 18353, 18353, 18383, 18383, 18383, 18383, 18383,
18414, 18414, 18414, 18414, 18414, 18444, 18444, 18444, 18444,
18444, 18475, 18475, 18475, 18475, 18475, 18506, 18506, 18506,
18506, 18506, 18536, 18536, 18536, 18536, 18536), class = "Date"),
Region = structure(c(4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L,
9L, 4L, 6L, 7L, 8L, 9L, 4L, 6L, 7L, 8L, 9L),
.Label = c("England",
"South East", "South West", "London", "East of England",
"East Midlands", "West Midlands", "Yorkshire and The Humber",
"North East", "North West"), class = "factor"),
Unemployment.rate = c(4.2102766429572,
4.68247349426148, 5.0708122696351, 5.23113585152962, 5.05625777763551,
4.45850956493638, 4.24086209425895, 5.20425572086481, 4.90649662696461,
5.58119346747183, 4.36960549219723, 4.02517515965457, 5.07463979478007,
4.74861899849302, 5.41295614949722, 4.2765275404374, 4.29397104451947,
4.95863831882363, 4.92741739593892, 5.69156027694963, 4.2650375361128,
4.23454968410189, 4.79139912788739, 5.02305883708418, 5.5878529496241,
4.54049887070026, 4.28118824655063, 4.56621383409869, 5.02948552097342,
5.34849310422496, 4.63523851140925, 4.63665149464923, 4.15610221124255,
4.28827168334814, 4.97071907922267, 4.63148007856079, 4.50379542173275,
3.98279027057451, 4.00981283870947, 5.80674097480643, 4.5449089097835,
4.46358064141772, 4.09111105457073, 3.90122545742185, 5.85180583091048,
4.50615604436695, 3.65653388653173, 4.4653881330391, 4.08974888999112,
6.11361138828401, 4.31177130663949, 3.86911315140672, 4.31748261760943,
4.34062792253313, 6.21086689536757, 4.28854311714984, 3.58533538113168,
4.43826006085208, 4.47398990035041, 6.11583334445995, 4.4614986334698,
3.93320874039025, 4.50210360585639, 4.58329815843159, 6.1811363458787,
4.4993016103369, 4.02503140646339, 4.81764323428107, 4.71840892982655,
5.61192961811575, 4.66797282030472, 3.76788548732822, 5.02382063022771,
4.27033347501753, 5.40098295976569, 4.63121679655635, 3.67161258712684,
4.80322174913054, 3.91339590231661, 5.20229523339659, 5.10845457998552,
3.97182605242641, 4.85515814694348, 3.78242013517353, 4.97115704468143,
4.6437916194869, 4.3194319371037, 4.41226516242903, 3.75797094178592,
5.16820059074221, 4.98077486925899, 4.38753537321373, 4.37107017836121,
3.98499236263049, 5.15965087736712, 5.2511686249283, 4.39271393019063,
4.62628095567074, 4.16298001615593, 6.62714213785116, 5.95104220347072,
4.89588411607636, 4.9378241924801, 4.65307341597827, 6.67088507450695,
6.33714099073375, 5.32040137455687, 5.402969264185, 5.15177120913334,
6.56889233919367)),
row.names = c(NA, -110L), class = c("tbl_df", "tbl", "data.frame"))
【问题讨论】:
【参考方案1】:我们可以将“日期”列format
转换为year-month
或使用zoo
中的as.yearmon
,将其与“地区”和summarise
“失业”的mean
一起用作分组列.rate'
library(dplyr)
library(zoo)
df1 %>%
group_by(year_mon = as.yearmon(Date), Region) %>%
summarise(Mean_unemp = mean(Unemployment.rate, na.rm = TRUE), .groups = 'drop')
或者如果它只基于月份
df1 %>%
group_by(Month = format(Date, "%m"), Region) %>%
summarise(Mean_unemp = mean(Unemployment.rate, na.rm = TRUE), .groups = 'drop')
如果仅基于“区域”
df1 %>%
group_by(Region) %>%
summarise(Mean_unemp = mean(Unemployment.rate, na.rm = TRUE), .groups = 'drop')
或按“日期”分组
df1 %>%
group_by(Date) %>%
summarise(Mean = mean(Unemployment.rate))
【讨论】:
感谢您的提示!我尝试了您的代码,但它仅将日期列更改为月 - 年格式。然而,其他一切都保持不变。我每月仍然有 5 个失业率值,但我每个月只需要一个(因此我想计算每个月的平均值)。此外,这会将日期列更改为“yearmon”类。将其保留为“日期”类进行绘图不是更好吗? @Lactuca 我猜你的问题是要得到每个 yearmonth 和 Region 的平均值 @Lactuca 原因是每个区域和年月只有一个唯一的行。您可以使用table
或count(df1, Region, Date)
查看频率计数
对不起,我没有说清楚。我想做以下事情:选择每个地区的失业率值,然后除以地区的数量。如果我每个月都这样做,我应该每个月都有一个失业率值,而不是每个月的 X 个值,其中 X = n。的地区。这有意义吗?
@Lactuca 在这种情况下,您只需按“区域”分组【参考方案2】:
使用base R
:
#Code
df1$Date <- format(df1$Date,'%b-%Y')
#Aggregate
out <- aggregate(Unemployment.rate~.,data=df1,mean,na.rm=T)
输出:
head(out)
Date Region Unemployment.rate
1 Apr-2019 London 4.276528
2 Apr-2020 London 4.631217
3 Aug-2019 London 4.631480
4 Aug-2020 London 5.251169
5 Dec-2019 London 4.288543
6 Feb-2019 London 4.458510
每月另一种选择:
#Code
df1$Date <- format(df1$Date,'%b')
#Aggregate
out <- aggregate(Unemployment.rate~.,data=df1,mean,na.rm=T)
【讨论】:
鸭,你是圣人。抱歉,我没有说清楚;我想做以下事情:选择每个地区的失业率值,然后除以地区的数量。如果我每个月都这样做,我应该每个月都有一个失业率值,而不是每个月的 X 个值,其中 X = n。的地区。这有意义吗?以上是关于计算R中具有相似值的行的平均值的主要内容,如果未能解决你的问题,请参考以下文章
使用 R [关闭] 计算 data.frame 中存在的 NA 值的平均值