按组选择前 N 个值

Posted 2023-02-16

技术标签:

【中文标题】按组选择前 N 个值【英文标题】：Select the top N values by group 【发布时间】：2013-01-25 20:20:29 【问题描述】：

这是对a question asked on the r-help mailing list的回应。

Here are lots of examples 了解如何使用 sql 按组查找最高值，所以我想通过使用 R sqldf 包来转换这些知识很容易。

例如：当mtcars 按cyl 分组时，这里是cyl 的每个不同值的前三个记录。请注意，在这种情况下不包括关系，但最好展示一些不同的方式来处理关系。

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0

如何找到每组顶部或底部（最大或最小）N 条记录？

【问题讨论】：

如果需要为每组选择不同的k 记录，这个问题可以提供帮助：***.com/q/33988831/1840471 【参考方案1】：

# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on 
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum )
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
 else 
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties.  using `max` would *exclude* all ties
if ( find.maximum )
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
 else 
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )

# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

【讨论】：

@Arun ..别无选择？ :) psthanx 你的真棒回答这么简单的任务太复杂了！ @Arun 我投了反对票，因为它看起来太复杂了，正如我在上面的评论中抱怨的那样。也许我花了几个小时铲我的车道后有点暴躁...... 哈哈@Ista 有点不公平：P 我为新手写了很多 cmets，但实际上，一旦你摆脱了所有的意外情况和注释，它只需要三行代码.. 好的，积分。很抱歉投票失败。我认为没有撤消按钮...【参考方案2】：

使用data.table 似乎更简单，因为它在设置键的同时执行排序。

所以，如果我要按排序（升序）获得前 3 条记录，那么，

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]

会的。

如果你想要降序

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle

编辑：使用mpg 列整理关系：

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

【讨论】：

嗨。我没有关注.SD[...] 中的head(seq(.I)) 所做的事情。为什么不head(.SD,3)？或d[,.SD[head(order(mpg))],by=cyl]。 d 的键是一列 (cyl)，是否打算在键中包含 mpg？ @MatthewDowle, :) 意图是您的第一个建议head(.SD, 3)。我没有想到直接做head！我会编辑它。好的，太好了，+1。这些天我很少有什么可以评论的！ @Arun 我试过这个，但没用。我想从我的数据表中提取前 3 行。但是它提取了更多并且没有排序。请看my problem @Arun，如果您想按 mpg 排序，这也可以：d <- data.table(mtcars, key=c("cyl","mpg")) d[, head(.SD, 3), by=cyl]【参考方案3】：

随便排序（例如mpg，这个问题不清楚）

mt <- mtcars[order(mtcars$mpg), ]

然后使用by函数获取每组的前n行

d <- by(mt, mt["cyl"], head, n=4)

如果您希望结果为 data.frame：

Reduce(rbind, d)

编辑： 处理关系比较困难，但如果需要所有关系：

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])

另一种方法是根据一些其他信息打破平局，例如，

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

【讨论】：

@Arun 嗯，什么？当 cyl == 8 也有一个平局...... data.table 解决方案似乎忽略了。使用 by 我们可以在两种情况下保留两个匹配项 by(mtcars, mtcars["cyl"], function(x) x[rank(x$mpg) 你不能用x[ x$mpg < sort( x$mpg )[4]保存步骤吗？如果我们需要基于多个列，这个解决方案是如何工作的呢？例如=我们想要按 cyl 和颜色的顶部（假设有一个颜色列）.. 尝试了一堆东西，但似乎没有一个工作.. 谢谢！ @Jeff 您评论中的问题我不清楚。考虑创建一个新问题，您可以在其中提供理解和回答问题所需的详细信息。【参考方案4】：

如果 mtcars$mpg 在第四个位置有平局，那么这应该返回所有平局：

top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]

> top_mpg
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

由于在 3-4 位置存在平局，您可以通过将 4 更改为 3 来测试它，它仍然返回 4 个项目。这是逻辑索引，您可能需要添加一个删除 NA 的子句或将 which() 包裹在逻辑表达式周围。 “by” cyl 做到这一点并不难：

 Reduce(rbind,  by(mtcars, mtcars$cyl, 
        function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic       30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

将我对@Ista 的建议纳入：

Reduce(rbind,  by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )

【讨论】：

如果您事先不知道，就不知道不做是什么意思。它将返回 mpg 值等于或高于第四大值的所有行。同样，如果您选择第三大作为目标，您仍然会在四缸类别中获得 4 个项目。我认为这是安东尼的目标之一据我了解，所要求的任务是其中一个处理关系的正确答案。啊，那我们对任务的理解确实不同。你想要mtcars$mpg %in% sort( unique(mtcars$mpg))[1:3]。【参考方案5】：

您可以编写一个函数，按一个因素拆分数据库，按另一个所需变量排序，提取每个因素（类别）中所需的行数，然后将它们组合到一个数据库中。

top<-function(x, num, c1,c2)
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)

x 是数据框；

num 是您希望看到的行数；

c1 是您要分割的变量的列number；

c2 是您想要排名或处理关系的变量的列number。

使用 mtcars 数据，该函数提取每个气缸类中 3 最重的汽车（mtcars$wt 是第 6 列）（mtcars$cyl 是第 2 列）

 top(mtcars,3,2,6)
                         mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 4.Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 4.Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 4.Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
 6.Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 6.Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
 6.Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
 8.Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
 8.Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
 8.Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

您还可以通过将 lapply 函数中的 head 更改为 tail 或删除 order 中的减值=T 参数，轻松获得类中最轻的 函数将其返回到默认值，递减=F。

【讨论】：

【参考方案6】：

我更喜欢@Ista 解决方案，因为不需要额外的包并且很简单。data.table 解决方案的修改也解决了我的问题，并且更通用。我的 data.frame 是

> str(df)
'data.frame':   579 obs. of  11 variables:
 $ trees     : num  2000 5000 1000 2000 1000 1000 2000 5000 5000 1000 ...
 $ interDepth: num  2 3 5 2 3 4 4 2 3 5 ...
 $ minObs    : num  6 4 1 4 10 6 10 10 6 6 ...
 $ shrinkage : num  0.01 0.001 0.01 0.005 0.01 0.01 0.001 0.005 0.005 0.001     ...
 $ G1        : num  0 2 2 2 2 2 8 8 8 8 ...
 $ G2        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ qx        : num  0.44 0.43 0.419 0.439 0.43 ...
 $ efet      : num  43.1 40.6 39.9 39.2 38.6 ...
 $ prec      : num  0.606 0.593 0.587 0.582 0.574 0.578 0.576 0.579 0.588 0.585 ...
 $ sens      : num  0.575 0.57 0.573 0.575 0.587 0.574 0.576 0.566 0.542 0.545 ...
 $ acu       : num  0.631 0.645 0.647 0.648 0.655 0.647 0.619 0.611 0.591 0.594 ...

data.table 解决方案需要 order on i 来完成这项工作：

> require(data.table)
> dt1 <- data.table(df)
> dt2 = dt1[order(-efet, G1, G2), head(.SD, 3), by = .(G1, G2)]
> dt2
    G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu
 1:  0 FALSE  2000          2      6     0.010 0.4395953 43.066 0.606 0.575 0.631
 2:  0 FALSE  2000          5      1     0.005 0.4294718 37.554 0.583 0.548 0.607
 3:  0 FALSE  5000          2      6     0.005 0.4395753 36.981 0.575 0.559 0.616
 4:  2 FALSE  5000          3      4     0.001 0.4296346 40.624 0.593 0.570 0.645
 5:  2 FALSE  1000          5      1     0.010 0.4186802 39.915 0.587 0.573 0.647
 6:  2 FALSE  2000          2      4     0.005 0.4390503 39.164 0.582 0.575 0.648
 7:  8 FALSE  2000          4     10     0.001 0.4511349 38.240 0.576 0.576 0.619
 8:  8 FALSE  5000          2     10     0.005 0.4469665 38.064 0.579 0.566 0.611
 9:  8 FALSE  5000          3      6     0.005 0.4426952 37.888 0.588 0.542 0.591
10:  2  TRUE  5000          3      4     0.001 0.3812878 21.057 0.510 0.479 0.615
11:  2  TRUE  2000          3     10     0.005 0.3790536 20.127 0.507 0.470 0.608
12:  2  TRUE  1000          5      4     0.001 0.3690911 18.981 0.500 0.475 0.611
13:  8  TRUE  5000          6     10     0.010 0.2865042 16.870 0.497 0.435 0.635
14:  0  TRUE  2000          6      4     0.010 0.3192862  9.779 0.460 0.433 0.621

由于某种原因，它没有按指定的方式排序（可能是因为按组排序）。因此，完成了另一个排序。

> dt2[order(G1, G2)]
    G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu
 1:  0 FALSE  2000          2      6     0.010 0.4395953 43.066 0.606 0.575 0.631
 2:  0 FALSE  2000          5      1     0.005 0.4294718 37.554 0.583 0.548 0.607
 3:  0 FALSE  5000          2      6     0.005 0.4395753 36.981 0.575 0.559 0.616
 4:  0  TRUE  2000          6      4     0.010 0.3192862  9.779 0.460 0.433 0.621
 5:  2 FALSE  5000          3      4     0.001 0.4296346 40.624 0.593 0.570 0.645
 6:  2 FALSE  1000          5      1     0.010 0.4186802 39.915 0.587 0.573 0.647
 7:  2 FALSE  2000          2      4     0.005 0.4390503 39.164 0.582 0.575 0.648
 8:  2  TRUE  5000          3      4     0.001 0.3812878 21.057 0.510 0.479 0.615
 9:  2  TRUE  2000          3     10     0.005 0.3790536 20.127 0.507 0.470 0.608
10:  2  TRUE  1000          5      4     0.001 0.3690911 18.981 0.500 0.475 0.611
11:  8 FALSE  2000          4     10     0.001 0.4511349 38.240 0.576 0.576 0.619
12:  8 FALSE  5000          2     10     0.005 0.4469665 38.064 0.579 0.566 0.611
13:  8 FALSE  5000          3      6     0.005 0.4426952 37.888 0.588 0.542 0.591
14:  8  TRUE  5000          6     10     0.010 0.2865042 16.870 0.497 0.435 0.635

【讨论】：

【参考方案7】：

dplyr 成功了

mtcars %>% 
arrange(desc(mpg)) %>% 
group_by(cyl) %>% slice(1:2)


 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2

【讨论】：

如果用户对类似于 SQL 的结果感兴趣，那么这个 dplyr 结果就是要走的路嘿，Azam，您还在这里回答后续问题吗？我将这个答案用于某事【参考方案8】：

至少有 4 种方法可以做到这一点，但是，每种方法都有一些区别。我们使用 u_id 进行分组并使用提升值进行排序/排序

1 dplyr 传统方式

library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)

如果你切换arrange(desc(lift))和group_by(u_id)的顺序，结果本质上是一样的。如果有相同的提升值，它将切片以确保每个组没有更多超过 10 个值，如果组中只有 5 个提升值，则该组只会为您提供 5 个结果。

2 dplyr topN 方式

library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)

如果您的提升值相同，例如对于相同的 u_id 有 15 个相同的提升，您将获得所有 15 个观察结果

3 data.table 尾部方式

library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]

与第一种方式的行号相同，但是有些行不同，我猜他们使用的是diff随机算法处理平局。

4 data.table .SD方式

library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]

这种方式是最“统一”的方式，如果在一个组中只有 5 个观察值，它将重复值以使其达到 10 个观察值，如果存在平局，它仍然会切片并仅保留 10 个观察值。

【讨论】：

【参考方案9】：

由于dplyr 1.0.0，实现了slice_max()/slice_min()函数：

mtcars %>%
 group_by(cyl) %>%
 slice_max(mpg, n = 2, with_ties = FALSE)

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
3  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
4  21       6 160     110  3.9   2.62  16.5     0     1     4     4
5  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
6  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2

关于with_ties参数的文档：

应该保持联系吗？默认值 TRUE 可能会返回更多行比你要求的。使用 FALSE 忽略平局，并返回第一个 n 行。

【讨论】：

【参考方案10】：

data.table 选择每组最低 3 mpg 的方法：

data("mtcars")
setDT(mtcars)[order(mpg), head(.SD, 3), by = "cyl"]

【讨论】：

以上是关于按组选择前 N 个值的主要内容，如果未能解决你的问题，请参考以下文章