将查询分组为组和子组

Posted 2023-02-16

技术标签:

【中文标题】将查询分组为组和子组【英文标题】：Grouping query into group and subgroup 【发布时间】：2014-11-01 11:09:16 【问题描述】：

我想使用 SQL 或 R 对我的数据进行分组，以便我可以获得每个 Company 和 Area_code 的顶部或底部 10 个 Subarea_codes。本质上：Area_codes 中的Subarea_codes，其中每个Company 都有其最大或最小的结果。

data.csv

Area_code  Subarea_code  Company   Result
10         101           A         15
10         101           P         10
10         101           C         4
10         102           A         10
10         102           P         8
10         102           C         5
11         111           A         15
11         111           P         20
11         111           C         5
11         112           A         10
11         112           P         5
11         112           C         10


result.csv should be like this

Company   Area_code  Largest_subarea_code  Result  Smallest_subarea_code    Result
A         10         101                   15      102                      10
P         10         101                   10      102                      8            
C         10         102                   5       101                      4
A         11         111                   15      112                      10
P         11         111                   20      112                      5
C         11         112                   10      111                      5

在每个 Area_code 中可以有数百个 Subarea_codes，但我只希望每个公司的前 10 个和后 10 个。

此外，这不必在一个查询中解决，而是可以分为两个查询，即最小的显示在 results_10_smallest 中，最大的显示在 result_10_largest 中。但我希望我可以通过对每个结果的一个查询来完成这一点。

我尝试过的：

SELECT Company, Area_code, Subarea_code MAX(Result) 
    AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;

这给了我所有Companies 在每个 Subarea_code 中的最高结果。这意味着：A，A，P，A-C 对于上面的数据。

【问题讨论】：

【参考方案1】：

显示的输出与描述之间似乎存在差异。描述要求每个区号/公司的前 10 和后 10 个结果，但示例输出仅显示前 1 和后 1。例如，对于区号 10 和公司 A，子区域 101 是顶部，结果为 15并且子区域 102 是第二大区域，结果为 10，因此根据描述，该公司/区域代码组合应该有两行。（如果有更多数据，则该公司/地区代码组合最多可以有 10 行。）

我们给出两个答案。第一个假设前 10 名和后 10 名都需要每个公司和地区代码，如问题描述中所示，第二个假设只有每个公司和地区代码的顶部和底部，如问题的示例输出所示。

1) 顶部/底部 10

这里我们假设需要每个公司/地区代码的前 10 和后 10 的结果。如果它只是顶部和底部的，那么稍后请参阅 (2)（或在此处的代码中将 10 替换为 1）。 Bottom10 是具有相同区域代码和公司的 10 个或更少子区域且结果相等或更少的所有行。 Top10 类似。

library(sqldf)

Bottom10 <- sqldf("select a.Company, 
                          a.Area_code, 
                          a.Subarea_code Bottom_Subarea, 
                          a.Result Bottom_Result,
                          count(*) Bottom_Rank
        from df a join df b  
        on a.Company = b.Company and 
           a.Area_code = B.Area_code and
           b.Result <= a.Result
        group by a.Company, a.Area_code, a.Subarea_code
        having count(*) <= 10")

Top10 <- sqldf("select a.Company, 
                       a.Area_code, 
                       a.Subarea_code Top_Subarea, 
                       a.Result Top_Result,
                       count(*) Top_Rank
        from df a join df b  
        on a.Company = b.Company and 
           a.Area_code = B.Area_code and 
           b.Result >= a.Result
        group by a.Company, a.Area_code, a.Subarea_code
        having count(*) <= 10")

描述表明您想要每个公司/地区代码的前 10 个或后 10 个，在这种情况下，只需使用上述结果之一。如果您想合并它们，我们将在下面显示合并。我们添加了 Rank 列来表示最小/最大（Rank 为 1）、次小/最大（Rank 为 2）等。

sqldf("select t.Area_code, 
              t.Company, 
              t.Top_Rank Rank,
              t.Top_Subarea, 
              t.Top_Result,
              b.Bottom_Subarea,
              b.Bottom_Result
       from Bottom10 b join Top10 t
       on t.Area_code = b.Area_code and 
          t.Company = b.Company and
          t.Top_Rank = b.Bottom_Rank
       order by t.Area_code, t.Company, t.Top_Rank")

给予：

   Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1         10       A    1         101         15            102            10
2         10       A    2         102         10            101            15
3         10       C    1         102          5            101             4
4         10       C    2         101          4            102             5
5         10       P    1         101         10            102             8
6         10       P    2         102          8            101            10
7         11       A    1         111         15            112            10
8         11       A    2         112         10            111            15
9         11       C    1         112         10            111             5
10        11       C    2         111          5            112            10
11        11       P    1         111         20            112             5
12        11       P    2         112          5            111            20

请注意，如果存在关联，则此格式的意义不大，事实上，可能会为公司/地区代码生成超过 10 行，因此在这种情况下您可能只想使用单独的 Top10 和 Bottom10 .如果这是一个问题，您还可以考虑抖动df$Result：

df$Result <- jitter(df$Result)
# now perform SQL statements

2) 仅顶部/底部

这里我们只给出每个公司/区域代码的顶部和底部结果以及相应的子区域。请注意，这使用了 sqlite 支持的 SQL 扩展，并且 SQL 代码要简单得多：

Bottom1 <- sqldf("select Company, 
                          Area_code, 
                          Subarea_code Bottom_Subarea, 
                          min(Result) Bottom_Result
        from df
        group by Company, Area_code")

Top1 <- sqldf("select Company, 
                      Area_code, 
                      Subarea_code Top_Subarea, 
                      max(Result) Top_Result
        from df
        group by Company, Area_code")

sqldf("select a.Company, 
              a.Area_code, 
              Top_Subarea, 
              Top_Result,
              Bottom_Subarea
              Bottom_Result
        from Top1 a join Bottom1 b  
        on a.Company = b.Company and 
           a.Area_code = b.Area_code
        order by a.Area_code, a.Company")

这给出了：

  Company Area_code Top_Subarea Top_Result Bottom_Result
1       A        10         101         15           102
2       C        10         102          5           101
3       P        10         101         10           102
4       A        11         111         15           112
5       C        11         112         10           111
6       P        11         111         20           112

更新更正并添加（2）。

【讨论】：

【参考方案2】：

在此脚本中，用户声明了公司。然后，该脚本会指示最大的前 10 个结果（最小值同上）。

Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values

Subarea<-unique(A$Subarea)

for (i in 1:length(unique(A$Subarea)))
Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]

【讨论】：

【参考方案3】：

以上答案可以获取最大结果。

这解决了 top10 问题：

data.top <- data[ave(-data$Result, data$Company, data$Area_code, FUN = rank) <= 10, ]

【讨论】：

可能data[with(data, ave(-Result, Company, Area_code, FUN = rank)) <= 10, ] 会更好。可以用data.table 做类似的事情。比如：setDT(data)[, .SD[rank(-Result) <= 10], by = list(Company, Area_code)] 如何将其调整为 bottom10？删除data$Result之前的-？是的...当我尝试时，我以为我得到了不同的结果，但它奏效了，干杯。【参考方案4】：

如果您已经在 R 中这样做了，为什么不使用更高效的 data.table 而不是使用 SQL 语法的 sqldf 呢？假设 data 是您的数据集，简单地说：

library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
            Resultmax = max(Result),
            Smallest_subarea_code = Subarea_code[which.min(Result)],
            Resultmin = min(Result)), by = list(Company, Area_code)]
#    Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1:       A        10                  101        15                   102        10
# 2:       P        10                  101        10                   102         8
# 3:       C        10                  102         5                   101         4
# 4:       A        11                  111        15                   112        10
# 5:       P        11                  111        20                   112         5
# 6:       C        11                  112        10                   111         5

【讨论】：

【参考方案5】：

使用sqldf 包：

df <- read.table(text="Area_code  Subarea_code  Company   Result
10         101           A         15
10         101           P         10
10         101           C         4
10         102           A         10
10         102           P         8
10         102           C         5
11         111           A         15
11         111           P         20
11         111           C         5
11         112           A         10
11         112           P         5
11         112           C         10", header=TRUE)

library(sqldf)
mymax <- sqldf("select Company,
                  Area_code,
                  max(Subarea_code) Largest_subarea_code
                  from df
                  group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
                          d.Area_code,
                          m.Largest_subarea_code,
                          d.Result
                  from df d, mymax m
                  where d.Company=m.Company and
                        d.Subarea_code=m.Largest_subarea_code")

mymin <- sqldf("select Company,
                  Area_code,
                  min(Subarea_code) Smallest_subarea_code
                  from df
                  group by Company,Area_code")
myminres <- sqldf("select d.Company,
                          d.Area_code,
                          m.Smallest_subarea_code,
                          d.Result
                  from df d, mymin m
                  where d.Company=m.Company and
                        d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
                from mymaxres a, myminres b
                where a.Company=b.Company and 
                      a.Area_code=b.Area_code")

【讨论】：

+1。我猜预期的结果与result 略有不同。这可能是 OP 的错误。你的意思是哪个错误？没有正确测试，这应该足够开始了，随意编辑。是的，上面的最大/最小查询是错误的，因为它应该是 max(Result)´, Subarea_code Largest_subarea_code´。但除此之外它还有效！

以上是关于将查询分组为组和子组的主要内容，如果未能解决你的问题，请参考以下文章