开窗函数

Posted 2023-05-04

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了开窗函数相关的知识，希望对你有一定的参考价值。

参考技术A over在聚合函数中的使用：
一般格式：
聚合函数名(列) over(选项)
over必须与聚合函数或排序函数一起使用，聚合函数为：
sum(),max(),min(),count(),avg()
排序函数为：
rank(),row_number(),dense_rank(),ntile()
over表示把函数当成开窗函数而不是聚合函数，SQL标准允许将所有聚合函数用做开窗函数，使用over关键字来区分这两种用法。
开窗函数不需要使用group by就可以对数据进行分组，就可以同时返回基础行的列和聚合列。

开窗函数sum(*) over()，对于查询结果的每一行都返回所有符合条件的行的条数，over关键字后的括号中还经常添加选项来改变进行聚合运算的窗口范围，如果over关键字后的括号中选项为空，则开窗函数会对结果集中的所有行进行聚合运算。

常用格式：
sum(*) over(partition by A order by B)
partition by：进行分组，得到对应组内的所有求和值
order by：按照B进行排序，得到对应组内的累计求和值(如果B为id,两个id相同，则这两个id返回的sum那一列是相同的聚合值，是累计到最后一个id对应值的和--下面的例子会详细说明)

order by 字段名 rows|range between 边界规则1 and 边界规则2
rows：表示按照行的范围进行范围的定位
range：表示按照取值的范围进行范围的定位
这两种不同的定位方式主要用来处理并列排序的情况（见下面的例子）
边界规则的可取值为：
current row--当前行
n preceding--前n行
unbounded preceding--一直到第一条记录
n following--后n行
unbounded following--一直到最后一条记录
'range/rows between 边界规则1 and 边界规则2'：用来定位聚合计算范围，被称为定位框架。

eg:
1、建表

2、插入数据

3、关于partition by
（1）所属城市的人员数-按城市进行分组聚合

（2）显示每一个人员的信息、所属城市的人员数以及同龄人的人数

在同一个SELECT语句中可以同时使用多个开窗函数，而且这些开窗函数并不会相互干扰。
4、关于order by的详解：
（1）查询从第一行到当前行的的工资总和

（2）将上面的row换成range

结果和(1)的区别体现在红框和黄框部分，按照FSalary进行排序，row-按照行的范围进行范围定位，所以每一行后面对应的‘到当前行工资求和’都不一样，都严格的是第一行到当前行的累计和；range-按照取值的范围进行范围定位，虽然定位框架的语法仍然是从第一行到当前行的累计和，但是由于取值的范围：等于2000元的工资有3人，所以计算的累计为从第一条到2000元工资的最后一个人，写在每个2000元工资的人的后面都是7000。
（3）将(2)中的定位框架省略

上述框架是开窗函数中最常用的定位框架，如果是这种框架的话，可以省略上述定位框架部分

得到的结果和(2)的结果一样。
（4）将上面的sum()换成count()，计算工资排名

按照salary进行排序，然后计算从第一行（unbounded preceding）到当前行（current row）的人员的个数，相当于计算人员的的工资水平排名。
Question:
怎么让工资为2000元的排名都为2？--见后面排序函数的rank()和dence_rank()
5、关于over(partition by A order by B)

over在排序函数中的使用：
一般格式：
排序函数(列) over(选项)
排序函数为：
rank(),dense_rank(),row_number(),ntile(),lead(),lag()
1、rank(),dense_rank(),row_number()的区别

rank()与dense_rank()的区别：
两者都是计算一组数值中的排序值，
但是在有并列关系时，dence_rank中相关等级不会跳过，rank则跳过。
rank() 是跳跃排序，有两个第二名时接下来就是第四名（同样是在各个分组内）
dense_rank()是连续排序，有两个第二名时仍然跟着第三名。
row_number()：
row_number over(partition by A order by B)
根据A分组，在分组内根据B排序，且得出来的值是每组内部排序后的顺序编号（组内连续的唯一的）
其主要是‘行’的信息，并没有排名。row_number()必须与order by一起使用，
多用于分页查询，比如查询10-100个学生。

2、ntile(x)--平均分区函数

3、lag() over(partition by A order by B)
lead() over(partition by A order by B)
lag和lead中有三个参数，lag('列名',offset,'超出记录窗口时的默认值')
lag和lead可以获取，按一定顺序B排列的当前行的上下相邻若干offset的莫隔行的某个列。
lag()是向前，lead()是向后。

参考 https://www.cnblogs.com/lihaoyang/p/6756956.html

大数据之Hive：Hive 开窗函数

1.什么是开窗函数？

普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值，而开窗函数则可为窗口中的每行都返回一个值。简单理解，就是对查询的结果多出一列，这一列可以是聚合值，也可以是排序值。

开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

2.聚合开窗函数

聚合开窗函数：有count开窗函数，sum开窗函数，avg开窗函数，min开窗函数，max开窗函数，first_value开窗函数，last_value开窗函数，lag开窗函数，lead开窗函数，cume_dist开窗函数等等；
虽然聚合开窗函数，品类繁多，但本质都是一样的，受篇幅限制，只能重点说几个，入选的理由，一是大家可以通过它，触类旁通，二是因为陌生，所以更应该熟悉。
综上：着重说明一下count开窗函数，first_value开窗函数，lag开窗函数和cume_dist开窗函数等；

2.1 count开窗函数

语义：求个数（行数）
数据准备：

-- 建表
create table student_scores(
id int,
studentId int,
language int,
math int,
english int,
classId string,
departmentId string
);
-- 写入数据
insert into table student_scores values 
  (1,111,68,69,90,'class1','department1'),
  (2,112,73,80,96,'class1','department1'),
  (3,113,90,74,75,'class1','department1'),
  (4,114,89,94,93,'class1','department1'),
  (5,115,99,93,89,'class1','department1'),
  (6,121,96,74,79,'class2','department1'),
  (7,122,89,86,85,'class2','department1'),
  (8,123,70,78,61,'class2','department1'),
  (9,124,76,70,76,'class2','department1'),
  (10,211,89,93,60,'class1','department2'),
  (11,212,76,83,75,'class1','department2'),
  (12,213,71,94,90,'class1','department2'),
  (13,214,94,94,66,'class1','department2'),
  (14,215,84,82,73,'class1','department2'),
  (15,216,85,74,93,'class1','department2'),
  (16,221,77,99,61,'class2','department2'),
  (17,222,80,78,96,'class2','department2'),
  (18,223,79,74,96,'class2','department2'),
  (19,224,75,80,78,'class2','department2'),
  (20,225,82,85,63,'class2','department2');

-- count 开窗函数

select studentId,math,departmentId,classId,
-- 以符合条件的所有行作为窗口
count(math) over() as count1,
 -- 以按classId分组的所有行作为窗口
count(math) over(partition by classId) as count2,
 -- 以按classId分组、按math排序的所有行作为窗口
count(math) over(partition by classId order by math) as count3,
 -- 以按classId分组、按math排序、按当前行+往前1行+往后2行的行作为窗口
count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4
from student_scores where departmentId='department1';

结果
studentid   math    departmentid    classid count1  count2  count3  count4
111         69      department1     class1  9       5       1       3
113         74      department1     class1  9       5       2       4
112         80      department1     class1  9       5       3       4
115         93      department1     class1  9       5       4       3
114         94      department1     class1  9       5       5       2
124         70      department1     class2  9       4       1       3
121         74      department1     class2  9       4       2       4
123         78      department1     class2  9       4       3       3
122         86      department1     class2  9       4       4       2

结果解释:
studentid=115,count1为所有的行数9,count2为分区class1中的行数5,count3为分区class1中math值<=93的行数4,
count4为分区class1中math值向前+1行向后+2行(实际只有1行)的总行数3。

备注：这里不应该简单的理解order by 为排序，应该理解为窗口是到当前行，前多少行的窗口，然后是在这个窗口里面的一些聚合计算；我们可以理解sum(math) over(partition by classId order by math) as sum3中对应列的值, 理解为首先是到几行的窗口大小，然后是在窗口内的sum计算的值；

2.2 sum开窗函数

语义：求和（某一列属性的和）
同上

2.3 avg开窗函数

语义：求平均值（某一列属性的平均值）
同上
备注：因为数据到数学计算，有时为了数值美观，我们经常借助round()函数；

2.4 min开窗函数

语义：求最小值（某一列属性的最小值）
同上

2.5 max开窗函数

语义：求最大值（某一列属性的最大值）
同上

2.6 first_value开窗函数

语义：求第一个值（某一列属性的第一个值）
同上

-- first_value 开窗函数

select studentId,math,departmentId,classId,
-- 以符合条件的所有行作为窗口
first_value(math) over() as first_value1,
-- 以按classId分组的所有行作为窗口
first_value(math) over(partition by classId) as first_value2,
 -- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口
first_value(math) over(partition by classId order by math) as first_value3,
 -- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口
first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value4
from student_scores where departmentId='department1';

结果
studentid   math    departmentid    classid first_value1    first_value2    first_value3    first_value4
111         69      department1     class1  69              69              69              69
113         74      department1     class1  69              69              69              69
112         80      department1     class1  69              69              69              74
115         93      department1     class1  69              69              69              80
114         94      department1     class1  69              69              69              93
124         70      department1     class2  69              74              70              70
121         74      department1     class2  69              74              70              70
123         78      department1     class2  69              74              70              74
122         86      department1     class2  69              74              70              78

结果解释:
    studentid=124 first_value1:第一个值是69,first_value2:classId=class1分区 math的第一个值是69。

参考：https://blog.csdn.net/wangpei1949/article/details/81437574

以上是关于开窗函数的主要内容，如果未能解决你的问题，请参考以下文章