我如何计算正确的平均值?

Posted

技术标签:

【中文标题】我如何计算正确的平均值?【英文标题】:How do I calculate the right mean? 【发布时间】:2021-08-05 07:23:27 【问题描述】:

我有一个数据集,显示了几个国家的双边出口。因为数据波动,我需要计算年级的平均值。并非所有国家都准确涵盖了年份。有些开始较晚,有些介于两者之间 - 这意味着缺少一些年份(但没有 NA 条目)。在一位了不起的社区成员的帮助下,我已经设法将数据切割成碎片:year_group。

下面我列出了另外两个问题以及我的代码、错误的输出以及底部数据集 total_trade 的一些示例输入数据

问题 1

我面临的问题是代码没有计算正确的方法。当我手动计算结果时,我得到的结果与我的代码不同。 (见下文)

这是我的代码

# create vectors for coding 4 years average
year_group_break <- c(1999, 2003, 2007, 2011, 2015, 2019)
year_group_labels <- c("1999-2002", "2003-2006", "2007-2010", "2011-2014", "2015-2018")
years <- c(1999, 2000, 2001, 2002,2003, 2004,   2005,   2006,   2007,   2008,   2009,   2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)


FourY_av <- total_trade %>%
  # create year_group variable for average values with above predefined labels and cuts, 
  # chose right = FALSE to take cut before year_group_break
  mutate(year_group = cut(Year, breaks = year_group_break,
                          labels  = year_group_labels,
                          include.lowest = TRUE, right = FALSE)) %>%
  # add column with mean of total trade per four year period: "avg_year_group_total"
  group_by(ReporterName, year_group) %>%
  mutate(total_year_group = mean(Total_Year)) %>%
  arrange(ReporterName,PartnerName, desc(Year))
View(FourY_av)

下面是错误输出 此输出错误,因为 total_year_group(安哥拉年份组 '2015-2018' 的平均值)应为 34746013.5(手动计算时)而不是 34907582。(如输出所示)在哪里是我的错吗?

> head(FourY_av)
# A tibble: 6 x 9
# Groups:   ReporterName, year_group [1]
   Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group total_year_group
  <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>                 <dbl>
1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018         34907582.
2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018         34907582.
3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018         34907582.
4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018         34907582.
5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018         34907582.
6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018         34907582.

问题 2

另一个问题是,并非所有国家/地区都显示所有年份的数据。有的起步晚,有的有差距。 为了确保可比性,我仍然需要同年组的方法。数据集没有 NA。只是缺少数据。

例如安哥拉不包括 2008 年。数据集不包括 NA,但不包括安哥拉 2008 年的行和值。其他国家正在显示 2008 年的数据。我仍然需要在 total_year_group 列中获得安哥拉可用年份的平均值(取 2007、2009 和 2010 年的平均值)。这不应该是平均函数的问题,对吧?还是在这种情况下我需要考虑一些特殊的事情?

这里是 total_trade 的一些样本输入数据

dput(head(total_trade, n = 100))
structure(list(Year = c(2015L, 2018L, 2017L, 2016L, 2017L, 2015L, 
2018L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L, 
2015L, 2016L, 2017L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 
2016L, 2015L, 2018L, 2014L, 2015L, 2016L, 2017L, 2017L, 2018L, 
2016L, 2015L, 2016L, 2018L, 2017L, 2015L, 2010L, 2009L, 2016L, 
2013L, 2014L, 2018L, 2017L, 2015L, 2016L, 2017L, 2018L, 2017L, 
2018L, 2016L, 2016L, 2018L, 2007L, 2013L, 2009L, 2018L, 2015L, 
2016L, 2014L, 2010L, 2017L, 2012L, 2011L, 2018L, 2016L, 2015L, 
2016L, 2011L, 2018L, 2017L, 2015L, 2015L, 2016L, 2018L, 2017L, 
2015L, 2015L, 2016L, 2018L, 2017L, 2007L, 2014L, 2010L, 2013L, 
2011L, 2009L, 2012L, 2017L, 2018L, 2016L, 2015L, 2015L, 2015L, 
2017L, 2016L, 2018L, 2015L), ReporterName = c("Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola"
), PartnerName = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Albania", "Albania", "Albania", "Algeria", "Algeria", 
"Algeria", "Algeria", "American Samoa", "Andorra", "Andorra", 
"Antigua and Barbuda", "Antigua and Barbuda", "Antigua and Barbuda", 
"Antigua and Barbuda", "Argentina", "Argentina", "Argentina", 
"Argentina", "Armenia", "Armenia", "Armenia", "Armenia", "Australia", 
"Australia", "Australia", "Australia", "Australia", "Austria", 
"Austria", "Austria", "Austria", "Azerbaijan", "Azerbaijan", 
"Azerbaijan", "Azerbaijan", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahamas, The", "Bahamas, The", "Bahamas, The", 
"Bahamas, The", "Bahrain", "Bahrain", "Bahrain", "Bangladesh", 
"Bangladesh", "Bangladesh", "Barbados", "Belarus", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belize", "Belize", 
"Belize", "Benin", "Benin", "Benin", "Benin", "Benin", "Bhutan", 
"Bolivia", "Bolivia", "Bolivia", "Bolivia", "Botswana", "Botswana", 
"Botswana", "Botswana", "Brazil", "Brazil", "Brazil", "Brazil", 
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", 
"British Virgin Islands", "Brunei", "Bulgaria", "Bulgaria", "Bulgaria", 
"Bulgaria"), PartnerISO3 = c("AFG", "AFG", "AFG", "AFG", "ALB", 
"ALB", "ALB", "DZA", "DZA", "DZA", "DZA", "ASM", "AND", "AND", 
"ATG", "ATG", "ATG", "ATG", "ARG", "ARG", "ARG", "ARG", "ARM", 
"ARM", "ARM", "ARM", "AUS", "AUS", "AUS", "AUS", "AUS", "AUT", 
"AUT", "AUT", "AUT", "AZE", "AZE", "AZE", "AZE", "BHS", "BHS", 
"BHS", "BHS", "BHS", "BHS", "BHS", "BHS", "BHR", "BHR", "BHR", 
"BGD", "BGD", "BGD", "BRB", "BLR", "BEL", "BEL", "BEL", "BEL", 
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BLZ", "BLZ", 
"BLZ", "BEN", "BEN", "BEN", "BEN", "BEN", "BTN", "BOL", "BOL", 
"BOL", "BOL", "BWA", "BWA", "BWA", "BWA", "BRA", "BRA", "BRA", 
"BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "VGB", 
"BRN", "BGR", "BGR", "BGR", "BGR"), `TradeValue in 1000 USD` = c(39.586, 
19.353, 2.248, 0.775, 39.723, 2.259, 2.38, 2169.123, 2322.463, 
2241.599, 245.226, 12.007, 5.975, 0.326, 422.006, 155.467, 47.018, 
54.774, 483.147, 142.23, 98.7, 61.362, 60.105, 30.494, 0.99, 
0.731, 40220.092, 45435.804, 16096.404, 8546.882, 1904.301, 627.179, 
433.699, 23.118, 5.124, 985.67, 600.371, 143.356, 9.926, 140139.415, 
108214.936, 64444.203, 100210.999, 52974.059, 7322.893, 145.791, 
26.995, 4.847, 5.187, 1.958, 125.722, 55.22, 2.75, 3.366, 54.31, 
107976.895, 123610.469, 66757.2, 67763.201, 50046.64, 40199.706, 
52383.95, 45614.873, 28690.458, 52907.343, 39328.574, 452.078, 
5.82, 0.32, 970.324, 1700.981, 804.478, 332.216, 69.342, 1.632, 
1530.58, 308.752, 62.569, 19.822, 55.241, 37.029, 16.917, 0.198, 
874217.786, 1032751.313, 509259.955, 428750.075, 333280.441, 
192964.08, 315316.932, 119947.132, 141486.749, 66556.728, 1273.093, 
5.064, 22.324, 158.252, 33.583, 8.435, 0.077), Total_Year = c(33924937.48, 
42096736.31, 34904881.111, 28057499.527, 34904881.111, 33924937.48, 
42096736.31, 28057499.527, 33924937.48, 34904881.111, 42096736.31, 
42096736.31, 34904881.111, 42096736.31, 42096736.31, 33924937.48, 
28057499.527, 34904881.111, 28057499.527, 33924937.48, 34904881.111, 
42096736.31, 42096736.31, 34904881.111, 28057499.527, 33924937.48, 
42096736.31, 58672369.19, 33924937.48, 28057499.527, 34904881.111, 
34904881.111, 42096736.31, 28057499.527, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 52612114.76, 40639411.73, 
28057499.527, 67712526.544, 58672369.19, 42096736.31, 34904881.111, 
33924937.48, 28057499.527, 34904881.111, 42096736.31, 34904881.111, 
42096736.31, 28057499.527, 28057499.527, 42096736.31, 44177783.072, 
67712526.544, 40639411.73, 42096736.31, 33924937.48, 28057499.527, 
58672369.19, 52612114.76, 34904881.111, 70863076.416, 66427390.221, 
42096736.31, 28057499.527, 33924937.48, 28057499.527, 66427390.221, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527, 
42096736.31, 34904881.111, 44177783.072, 58672369.19, 52612114.76, 
67712526.544, 66427390.221, 40639411.73, 70863076.416, 34904881.111, 
42096736.31, 28057499.527, 33924937.48, 33924937.48, 33924937.48, 
34904881.111, 28057499.527, 42096736.31, 33924937.48), pct_by_partner_year = c(0.000116687024179005, 
4.59726850497024e-05, 6.44035999679013e-06, 2.7621848456389e-06, 
0.000113803567683494, 6.65881846158674e-06, 5.65364493454718e-06, 
0.0077309918437765, 0.00684588733986371, 0.00642202158738646, 
0.000582529719629944, 2.8522401146684e-05, 1.71179497245645e-05, 
7.74406827169068e-07, 0.00100246726228929, 0.000458267609458834, 
0.000167577299448064, 0.000156923611416451, 0.00172198880208503, 
0.000419249114560196, 0.000282768474948037, 0.00014576426910659, 
0.000142778289407966, 8.73631395649993e-05, 3.5284683834613e-06, 
2.15475710288619e-06, 0.0955420669759755, 0.0774398658640565, 
0.0474471147057807, 0.0304620231456308, 0.00545568682484317, 
0.00179682319502973, 0.00103024376238159, 8.23950829180388e-05, 
1.51039335091503e-05, 0.00351303578942051, 0.00142616994243657, 
0.000410704736521284, 2.92587127267419e-05, 0.2663633948935, 
0.266280763902189, 0.229686194730164, 0.147994771595005, 0.0902879153020956, 
0.0173953936620509, 0.000417680838208199, 7.95727332317548e-05, 
1.72752386410474e-05, 1.48603858110989e-05, 4.65119192514428e-06, 
0.000360184581635431, 0.000131174064405754, 9.80130106517029e-06, 
1.19967925037684e-05, 0.000129012376636663, 0.244414471464133, 
0.18255184868885, 0.164267141570654, 0.160970200874938, 0.147521686751831, 
0.143276153177212, 0.0892821454514031, 0.0867003221749986, 0.0821961201035531, 
0.0746613690455784, 0.0592053577133712, 0.00107390272887404, 
2.07431171633786e-05, 9.43258923288073e-07, 0.00345834096536738, 
0.0025606620918584, 0.00191102225615742, 0.000951775194258733, 
0.000204398313308255, 4.81062050876917e-06, 0.00545515468521031, 
0.000733434529761055, 0.000179255731601051, 5.84289949294256e-05, 
0.000162833019316739, 0.000131975409869888, 4.01860131755188e-05, 
5.6725590719059e-07, 1.97886296054109, 1.76020046106476, 0.967951882039117, 
0.633191666126054, 0.50172141324715, 0.474820062066878, 0.444966473299775, 
0.34363999584631, 0.336099093188823, 0.237215465105691, 0.00375267603882995, 
1.49270724610338e-05, 6.58041006358842e-05, 0.000453380716286491, 
0.00011969348860786, 2.00371827827334e-05, 2.26971678416193e-07
)), row.names = c(NA, -100L), groups = structure(list(Year = c(2007L, 
2007L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L, 
2011L, 2012L, 2012L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 
2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L
), ReporterName = c("Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Angola", "Angola"), PartnerName = c("Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Bahamas, The", 
"Belgium", "Brazil", "Belgium", "Benin", "Brazil", "Belgium", 
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Australia", "Bahamas, The", 
"Belgium", "Brazil", "Afghanistan", "Albania", "Algeria", "Antigua and Barbuda", 
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", 
"Bahamas, The", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", 
"Botswana", "Brazil", "British Virgin Islands", "Brunei", "Bulgaria", 
"Afghanistan", "Algeria", "Antigua and Barbuda", "Argentina", 
"Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas, The", 
"Bahrain", "Bangladesh", "Barbados", "Belgium", "Belize", "Benin", 
"Bolivia", "Botswana", "Brazil", "Bulgaria", "Afghanistan", "Albania", 
"Algeria", "Andorra", "Antigua and Barbuda", "Argentina", "Armenia", 
"Australia", "Austria", "Azerbaijan", "Bahamas, The", "Bahrain", 
"Bangladesh", "Belgium", "Benin", "Bolivia", "Botswana", "Brazil", 
"Bulgaria", "Afghanistan", "Albania", "Algeria", "American Samoa", 
"Andorra", "Antigua and Barbuda", "Argentina", "Armenia", "Australia", 
"Austria", "Azerbaijan", "Bahamas, The", "Bahrain", "Bangladesh", 
"Belarus", "Belgium", "Belize", "Benin", "Bolivia", "Botswana", 
"Brazil", "Bulgaria"), .rows = structure(list(56L, 84L, 41L, 
    58L, 89L, 40L, 63L, 86L, 66L, 71L, 88L, 65L, 90L, 43L, 57L, 
    87L, 28L, 44L, 62L, 85L, 1L, 6L, 9L, 16L, 20L, 26L, 29L, 
    35L, 39L, 47L, 60L, 69L, 74L, 75L, 79L, 80L, 94L, 95L, 96L, 
    100L, 4L, 8L, 17L, 19L, 25L, 30L, 34L, 36L, 42L, 48L, 53L, 
    54L, 61L, 68L, 70L, 76L, 81L, 93L, 98L, 3L, 5L, 10L, 13L, 
    18L, 21L, 24L, 31L, 32L, 38L, 46L, 49L, 51L, 64L, 73L, 78L, 
    83L, 91L, 97L, 2L, 7L, 11L, 12L, 14L, 15L, 22L, 23L, 27L, 
    33L, 37L, 45L, 50L, 52L, 55L, 59L, 67L, 72L, 77L, 82L, 92L, 
    99L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), row.names = c(NA, 100L), class = c("tbl_df", "tbl", 
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"))

【问题讨论】:

你没有包含year_group_break所以代码会抛出错误 谢谢!刚刚更新了帖子 【参考方案1】:

mean 的问题是数据中任何 ReporterName 的行重复。

问题 1

total_trade %>%
  # create year_group variable for average values with above predefined labels and cuts, 
  # chose right = FALSE to take cut before year_group_break
  mutate(year_group = cut(Year, breaks = year_group_break,
                          labels  = year_group_labels,
                          include.lowest = TRUE, right = FALSE)) %>%
  # add column with mean of total trade per four year period: "avg_year_group_total"
  group_by(ReporterName, year_group) %>%
  mutate(dup = !duplicated(paste0(ReporterName, year_group, Total_Year)),
         total_year_group = sum(Total_Year * dup)/sum(dup)) %>%
  arrange(ReporterName,PartnerName, desc(Year))

# A tibble: 100 x 10
# Groups:   ReporterName, year_group [3]
    Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group dup   total_year_group
   <int> <chr>        <chr>       <chr>                          <dbl>      <dbl>               <dbl> <fct>      <lgl>            <dbl>
 1  2018 Angola       Afghanistan AFG                           19.4    42096736.          0.0000460  2015-2018  TRUE         34746014.
 2  2017 Angola       Afghanistan AFG                            2.25   34904881.          0.00000644 2015-2018  TRUE         34746014.
 3  2016 Angola       Afghanistan AFG                            0.775  28057500.          0.00000276 2015-2018  TRUE         34746014.
 4  2015 Angola       Afghanistan AFG                           39.6    33924937.          0.000117   2015-2018  TRUE         34746014.
 5  2018 Angola       Albania     ALB                            2.38   42096736.          0.00000565 2015-2018  FALSE        34746014.
 6  2017 Angola       Albania     ALB                           39.7    34904881.          0.000114   2015-2018  FALSE        34746014.
 7  2015 Angola       Albania     ALB                            2.26   33924937.          0.00000666 2015-2018  FALSE        34746014.
 8  2018 Angola       Algeria     DZA                          245.     42096736.          0.000583   2015-2018  FALSE        34746014.
 9  2017 Angola       Algeria     DZA                         2242.     34904881.          0.00642    2015-2018  FALSE        34746014.
10  2016 Angola       Algeria     DZA                         2169.     28057500.          0.00773    2015-2018  FALSE        34746014.
# ... with 90 more rows

问题 2

使用tidyr 中的complete。如果您可以显示所需的输出,我可能会告诉您如何操作。

【讨论】:

亲爱的@AnilGoyal,非常感谢!这看起来真的很有趣。我在 !duplicated() 上找不到太多信息。你有参考资料让我了解更多吗?它是重新定义的还是我之前需要定义的重复的否定函数?像`!dublicated`=否定(´dublicate´)。至于所需的最终输出,我将编辑我的帖子。现在,在应用您的建议之后,我的输出在最终输出中似乎看起来更加陌生,这可能意味着我在 cose 中有更多错误 :-( 或者您会建议打开一个新帖子吗? 不是复制的,是复制的。在 r 本身中使用 ?duplicated 搜索。建议使用最少的数据打开一个新问题,因为过多的行和列不必要地使事情复杂化。 非常感谢!我认为这个问题的问题已经解决了。你的回答真的很有帮助。我刚刚打开了一个更大的帖子,描述了我想要的总体输出和我的问题。希望帖子不要包含太多信息***.com/questions/67555981/…

以上是关于我如何计算正确的平均值?的主要内容,如果未能解决你的问题,请参考以下文章

R 不会使用聚合函数正确计算均值

移动平均线计算不正确

在线统计 Python:方差计算不正确

如何使用 python + NumPy / SciPy 计算滚动/移动平均值?

如何找到经度和纬度位置的平均值?

如何从对象数组中获得平均值