我如何计算正确的平均值?
Posted
技术标签:
【中文标题】我如何计算正确的平均值?【英文标题】:How do I calculate the right mean? 【发布时间】:2021-08-05 07:23:27 【问题描述】:我有一个数据集,显示了几个国家的双边出口。因为数据波动,我需要计算年级的平均值。并非所有国家都准确涵盖了年份。有些开始较晚,有些介于两者之间 - 这意味着缺少一些年份(但没有 NA 条目)。在一位了不起的社区成员的帮助下,我已经设法将数据切割成碎片:year_group。
下面我列出了另外两个问题以及我的代码、错误的输出以及底部数据集 total_trade 的一些示例输入数据
问题 1
我面临的问题是代码没有计算正确的方法。当我手动计算结果时,我得到的结果与我的代码不同。 (见下文)
这是我的代码
# create vectors for coding 4 years average
year_group_break <- c(1999, 2003, 2007, 2011, 2015, 2019)
year_group_labels <- c("1999-2002", "2003-2006", "2007-2010", "2011-2014", "2015-2018")
years <- c(1999, 2000, 2001, 2002,2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)
FourY_av <- total_trade %>%
# create year_group variable for average values with above predefined labels and cuts,
# chose right = FALSE to take cut before year_group_break
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# add column with mean of total trade per four year period: "avg_year_group_total"
group_by(ReporterName, year_group) %>%
mutate(total_year_group = mean(Total_Year)) %>%
arrange(ReporterName,PartnerName, desc(Year))
View(FourY_av)
下面是错误输出 此输出错误,因为 total_year_group(安哥拉年份组 '2015-2018' 的平均值)应为 34746013.5(手动计算时)而不是 34907582。(如输出所示)在哪里是我的错吗?
> head(FourY_av)
# A tibble: 6 x 9
# Groups: ReporterName, year_group [1]
Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group total_year_group
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
1 2018 Angola Afghanistan AFG 19.4 42096736. 0.0000460 2015-2018 34907582.
2 2017 Angola Afghanistan AFG 2.25 34904881. 0.00000644 2015-2018 34907582.
3 2016 Angola Afghanistan AFG 0.775 28057500. 0.00000276 2015-2018 34907582.
4 2015 Angola Afghanistan AFG 39.6 33924937. 0.000117 2015-2018 34907582.
5 2018 Angola Albania ALB 2.38 42096736. 0.00000565 2015-2018 34907582.
6 2017 Angola Albania ALB 39.7 34904881. 0.000114 2015-2018 34907582.
问题 2
另一个问题是,并非所有国家/地区都显示所有年份的数据。有的起步晚,有的有差距。 为了确保可比性,我仍然需要同年组的方法。数据集没有 NA。只是缺少数据。
例如安哥拉不包括 2008 年。数据集不包括 NA,但不包括安哥拉 2008 年的行和值。其他国家正在显示 2008 年的数据。我仍然需要在 total_year_group 列中获得安哥拉可用年份的平均值(取 2007、2009 和 2010 年的平均值)。这不应该是平均函数的问题,对吧?还是在这种情况下我需要考虑一些特殊的事情?
这里是 total_trade 的一些样本输入数据
dput(head(total_trade, n = 100))
structure(list(Year = c(2015L, 2018L, 2017L, 2016L, 2017L, 2015L,
2018L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L,
2015L, 2016L, 2017L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L,
2016L, 2015L, 2018L, 2014L, 2015L, 2016L, 2017L, 2017L, 2018L,
2016L, 2015L, 2016L, 2018L, 2017L, 2015L, 2010L, 2009L, 2016L,
2013L, 2014L, 2018L, 2017L, 2015L, 2016L, 2017L, 2018L, 2017L,
2018L, 2016L, 2016L, 2018L, 2007L, 2013L, 2009L, 2018L, 2015L,
2016L, 2014L, 2010L, 2017L, 2012L, 2011L, 2018L, 2016L, 2015L,
2016L, 2011L, 2018L, 2017L, 2015L, 2015L, 2016L, 2018L, 2017L,
2015L, 2015L, 2016L, 2018L, 2017L, 2007L, 2014L, 2010L, 2013L,
2011L, 2009L, 2012L, 2017L, 2018L, 2016L, 2015L, 2015L, 2015L,
2017L, 2016L, 2018L, 2015L), ReporterName = c("Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola"
), PartnerName = c("Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Albania", "Albania", "Albania", "Algeria", "Algeria",
"Algeria", "Algeria", "American Samoa", "Andorra", "Andorra",
"Antigua and Barbuda", "Antigua and Barbuda", "Antigua and Barbuda",
"Antigua and Barbuda", "Argentina", "Argentina", "Argentina",
"Argentina", "Armenia", "Armenia", "Armenia", "Armenia", "Australia",
"Australia", "Australia", "Australia", "Australia", "Austria",
"Austria", "Austria", "Austria", "Azerbaijan", "Azerbaijan",
"Azerbaijan", "Azerbaijan", "Bahamas, The", "Bahamas, The", "Bahamas, The",
"Bahamas, The", "Bahamas, The", "Bahamas, The", "Bahamas, The",
"Bahamas, The", "Bahrain", "Bahrain", "Bahrain", "Bangladesh",
"Bangladesh", "Bangladesh", "Barbados", "Belarus", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belize", "Belize",
"Belize", "Benin", "Benin", "Benin", "Benin", "Benin", "Bhutan",
"Bolivia", "Bolivia", "Bolivia", "Bolivia", "Botswana", "Botswana",
"Botswana", "Botswana", "Brazil", "Brazil", "Brazil", "Brazil",
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil",
"British Virgin Islands", "Brunei", "Bulgaria", "Bulgaria", "Bulgaria",
"Bulgaria"), PartnerISO3 = c("AFG", "AFG", "AFG", "AFG", "ALB",
"ALB", "ALB", "DZA", "DZA", "DZA", "DZA", "ASM", "AND", "AND",
"ATG", "ATG", "ATG", "ATG", "ARG", "ARG", "ARG", "ARG", "ARM",
"ARM", "ARM", "ARM", "AUS", "AUS", "AUS", "AUS", "AUS", "AUT",
"AUT", "AUT", "AUT", "AZE", "AZE", "AZE", "AZE", "BHS", "BHS",
"BHS", "BHS", "BHS", "BHS", "BHS", "BHS", "BHR", "BHR", "BHR",
"BGD", "BGD", "BGD", "BRB", "BLR", "BEL", "BEL", "BEL", "BEL",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BLZ", "BLZ",
"BLZ", "BEN", "BEN", "BEN", "BEN", "BEN", "BTN", "BOL", "BOL",
"BOL", "BOL", "BWA", "BWA", "BWA", "BWA", "BRA", "BRA", "BRA",
"BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "VGB",
"BRN", "BGR", "BGR", "BGR", "BGR"), `TradeValue in 1000 USD` = c(39.586,
19.353, 2.248, 0.775, 39.723, 2.259, 2.38, 2169.123, 2322.463,
2241.599, 245.226, 12.007, 5.975, 0.326, 422.006, 155.467, 47.018,
54.774, 483.147, 142.23, 98.7, 61.362, 60.105, 30.494, 0.99,
0.731, 40220.092, 45435.804, 16096.404, 8546.882, 1904.301, 627.179,
433.699, 23.118, 5.124, 985.67, 600.371, 143.356, 9.926, 140139.415,
108214.936, 64444.203, 100210.999, 52974.059, 7322.893, 145.791,
26.995, 4.847, 5.187, 1.958, 125.722, 55.22, 2.75, 3.366, 54.31,
107976.895, 123610.469, 66757.2, 67763.201, 50046.64, 40199.706,
52383.95, 45614.873, 28690.458, 52907.343, 39328.574, 452.078,
5.82, 0.32, 970.324, 1700.981, 804.478, 332.216, 69.342, 1.632,
1530.58, 308.752, 62.569, 19.822, 55.241, 37.029, 16.917, 0.198,
874217.786, 1032751.313, 509259.955, 428750.075, 333280.441,
192964.08, 315316.932, 119947.132, 141486.749, 66556.728, 1273.093,
5.064, 22.324, 158.252, 33.583, 8.435, 0.077), Total_Year = c(33924937.48,
42096736.31, 34904881.111, 28057499.527, 34904881.111, 33924937.48,
42096736.31, 28057499.527, 33924937.48, 34904881.111, 42096736.31,
42096736.31, 34904881.111, 42096736.31, 42096736.31, 33924937.48,
28057499.527, 34904881.111, 28057499.527, 33924937.48, 34904881.111,
42096736.31, 42096736.31, 34904881.111, 28057499.527, 33924937.48,
42096736.31, 58672369.19, 33924937.48, 28057499.527, 34904881.111,
34904881.111, 42096736.31, 28057499.527, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 33924937.48, 52612114.76, 40639411.73,
28057499.527, 67712526.544, 58672369.19, 42096736.31, 34904881.111,
33924937.48, 28057499.527, 34904881.111, 42096736.31, 34904881.111,
42096736.31, 28057499.527, 28057499.527, 42096736.31, 44177783.072,
67712526.544, 40639411.73, 42096736.31, 33924937.48, 28057499.527,
58672369.19, 52612114.76, 34904881.111, 70863076.416, 66427390.221,
42096736.31, 28057499.527, 33924937.48, 28057499.527, 66427390.221,
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 44177783.072, 58672369.19, 52612114.76,
67712526.544, 66427390.221, 40639411.73, 70863076.416, 34904881.111,
42096736.31, 28057499.527, 33924937.48, 33924937.48, 33924937.48,
34904881.111, 28057499.527, 42096736.31, 33924937.48), pct_by_partner_year = c(0.000116687024179005,
4.59726850497024e-05, 6.44035999679013e-06, 2.7621848456389e-06,
0.000113803567683494, 6.65881846158674e-06, 5.65364493454718e-06,
0.0077309918437765, 0.00684588733986371, 0.00642202158738646,
0.000582529719629944, 2.8522401146684e-05, 1.71179497245645e-05,
7.74406827169068e-07, 0.00100246726228929, 0.000458267609458834,
0.000167577299448064, 0.000156923611416451, 0.00172198880208503,
0.000419249114560196, 0.000282768474948037, 0.00014576426910659,
0.000142778289407966, 8.73631395649993e-05, 3.5284683834613e-06,
2.15475710288619e-06, 0.0955420669759755, 0.0774398658640565,
0.0474471147057807, 0.0304620231456308, 0.00545568682484317,
0.00179682319502973, 0.00103024376238159, 8.23950829180388e-05,
1.51039335091503e-05, 0.00351303578942051, 0.00142616994243657,
0.000410704736521284, 2.92587127267419e-05, 0.2663633948935,
0.266280763902189, 0.229686194730164, 0.147994771595005, 0.0902879153020956,
0.0173953936620509, 0.000417680838208199, 7.95727332317548e-05,
1.72752386410474e-05, 1.48603858110989e-05, 4.65119192514428e-06,
0.000360184581635431, 0.000131174064405754, 9.80130106517029e-06,
1.19967925037684e-05, 0.000129012376636663, 0.244414471464133,
0.18255184868885, 0.164267141570654, 0.160970200874938, 0.147521686751831,
0.143276153177212, 0.0892821454514031, 0.0867003221749986, 0.0821961201035531,
0.0746613690455784, 0.0592053577133712, 0.00107390272887404,
2.07431171633786e-05, 9.43258923288073e-07, 0.00345834096536738,
0.0025606620918584, 0.00191102225615742, 0.000951775194258733,
0.000204398313308255, 4.81062050876917e-06, 0.00545515468521031,
0.000733434529761055, 0.000179255731601051, 5.84289949294256e-05,
0.000162833019316739, 0.000131975409869888, 4.01860131755188e-05,
5.6725590719059e-07, 1.97886296054109, 1.76020046106476, 0.967951882039117,
0.633191666126054, 0.50172141324715, 0.474820062066878, 0.444966473299775,
0.34363999584631, 0.336099093188823, 0.237215465105691, 0.00375267603882995,
1.49270724610338e-05, 6.58041006358842e-05, 0.000453380716286491,
0.00011969348860786, 2.00371827827334e-05, 2.26971678416193e-07
)), row.names = c(NA, -100L), groups = structure(list(Year = c(2007L,
2007L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L,
2011L, 2012L, 2012L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L,
2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L
), ReporterName = c("Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola"), PartnerName = c("Belgium",
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Bahamas, The",
"Belgium", "Brazil", "Belgium", "Benin", "Brazil", "Belgium",
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Australia", "Bahamas, The",
"Belgium", "Brazil", "Afghanistan", "Albania", "Algeria", "Antigua and Barbuda",
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan",
"Bahamas, The", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia",
"Botswana", "Brazil", "British Virgin Islands", "Brunei", "Bulgaria",
"Afghanistan", "Algeria", "Antigua and Barbuda", "Argentina",
"Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas, The",
"Bahrain", "Bangladesh", "Barbados", "Belgium", "Belize", "Benin",
"Bolivia", "Botswana", "Brazil", "Bulgaria", "Afghanistan", "Albania",
"Algeria", "Andorra", "Antigua and Barbuda", "Argentina", "Armenia",
"Australia", "Austria", "Azerbaijan", "Bahamas, The", "Bahrain",
"Bangladesh", "Belgium", "Benin", "Bolivia", "Botswana", "Brazil",
"Bulgaria", "Afghanistan", "Albania", "Algeria", "American Samoa",
"Andorra", "Antigua and Barbuda", "Argentina", "Armenia", "Australia",
"Austria", "Azerbaijan", "Bahamas, The", "Bahrain", "Bangladesh",
"Belarus", "Belgium", "Belize", "Benin", "Bolivia", "Botswana",
"Brazil", "Bulgaria"), .rows = structure(list(56L, 84L, 41L,
58L, 89L, 40L, 63L, 86L, 66L, 71L, 88L, 65L, 90L, 43L, 57L,
87L, 28L, 44L, 62L, 85L, 1L, 6L, 9L, 16L, 20L, 26L, 29L,
35L, 39L, 47L, 60L, 69L, 74L, 75L, 79L, 80L, 94L, 95L, 96L,
100L, 4L, 8L, 17L, 19L, 25L, 30L, 34L, 36L, 42L, 48L, 53L,
54L, 61L, 68L, 70L, 76L, 81L, 93L, 98L, 3L, 5L, 10L, 13L,
18L, 21L, 24L, 31L, 32L, 38L, 46L, 49L, 51L, 64L, 73L, 78L,
83L, 91L, 97L, 2L, 7L, 11L, 12L, 14L, 15L, 22L, 23L, 27L,
33L, 37L, 45L, 50L, 52L, 55L, 59L, 67L, 72L, 77L, 82L, 92L,
99L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), row.names = c(NA, 100L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
【问题讨论】:
你没有包含year_group_break
所以代码会抛出错误
谢谢!刚刚更新了帖子
【参考方案1】:
mean
的问题是数据中任何 ReporterName
的行重复。
问题 1
total_trade %>%
# create year_group variable for average values with above predefined labels and cuts,
# chose right = FALSE to take cut before year_group_break
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# add column with mean of total trade per four year period: "avg_year_group_total"
group_by(ReporterName, year_group) %>%
mutate(dup = !duplicated(paste0(ReporterName, year_group, Total_Year)),
total_year_group = sum(Total_Year * dup)/sum(dup)) %>%
arrange(ReporterName,PartnerName, desc(Year))
# A tibble: 100 x 10
# Groups: ReporterName, year_group [3]
Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group dup total_year_group
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <lgl> <dbl>
1 2018 Angola Afghanistan AFG 19.4 42096736. 0.0000460 2015-2018 TRUE 34746014.
2 2017 Angola Afghanistan AFG 2.25 34904881. 0.00000644 2015-2018 TRUE 34746014.
3 2016 Angola Afghanistan AFG 0.775 28057500. 0.00000276 2015-2018 TRUE 34746014.
4 2015 Angola Afghanistan AFG 39.6 33924937. 0.000117 2015-2018 TRUE 34746014.
5 2018 Angola Albania ALB 2.38 42096736. 0.00000565 2015-2018 FALSE 34746014.
6 2017 Angola Albania ALB 39.7 34904881. 0.000114 2015-2018 FALSE 34746014.
7 2015 Angola Albania ALB 2.26 33924937. 0.00000666 2015-2018 FALSE 34746014.
8 2018 Angola Algeria DZA 245. 42096736. 0.000583 2015-2018 FALSE 34746014.
9 2017 Angola Algeria DZA 2242. 34904881. 0.00642 2015-2018 FALSE 34746014.
10 2016 Angola Algeria DZA 2169. 28057500. 0.00773 2015-2018 FALSE 34746014.
# ... with 90 more rows
问题 2
使用tidyr
中的complete
。如果您可以显示所需的输出,我可能会告诉您如何操作。
【讨论】:
亲爱的@AnilGoyal,非常感谢!这看起来真的很有趣。我在 !duplicated() 上找不到太多信息。你有参考资料让我了解更多吗?它是重新定义的还是我之前需要定义的重复的否定函数?像`!dublicated
`=否定(´dublicate
´)。至于所需的最终输出,我将编辑我的帖子。现在,在应用您的建议之后,我的输出在最终输出中似乎看起来更加陌生,这可能意味着我在 cose 中有更多错误 :-( 或者您会建议打开一个新帖子吗?
不是复制的,是复制的。在 r 本身中使用 ?duplicated
搜索。建议使用最少的数据打开一个新问题,因为过多的行和列不必要地使事情复杂化。
非常感谢!我认为这个问题的问题已经解决了。你的回答真的很有帮助。我刚刚打开了一个更大的帖子,描述了我想要的总体输出和我的问题。希望帖子不要包含太多信息***.com/questions/67555981/…以上是关于我如何计算正确的平均值?的主要内容,如果未能解决你的问题,请参考以下文章