sql Hive窗口和分析函数,超过子句
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了sql Hive窗口和分析函数,超过子句相关的知识,希望对你有一定的参考价值。
-- 官网链接: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
-- 参考链接1: http://www.cnblogs.com/yejibigdata/p/6376409.html
\!h -- 参考链接2: http://lxw1234.com/archives/tag/hive-window-functions
-- 与聚合函数一样,开窗函数也是对行集组进行聚合计算,但是它不像普通聚合函数那样, 每组只返回一行结果,开窗函数可以为每组每行都返回, 并附带统计值,因为开窗函数所执行聚合计算的行集组是窗口。在ISO SQL规定了这样的函数为开窗函数
-- Hive开窗函数的效果是: 在每一行都保留一个汇总结果, 而不是像聚合函数那样只保留一个汇总列
-- 注意: 开窗函数是无法和普通聚合操作一同使用的, 要么先开窗后聚合, 要么先聚合后开窗
-- 窗口子句
window子句: 可以单独定义window子句, 也可以直接将window子句接在order by子句的后面
一个例子: SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4 -- 窗口的范围是往前三行到当前行
如果不指定ROWS BETWEEN,默认为从起点到当前行;
一般要显示指定ORDER BY,否则分箱的顺序不可控;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
-- 窗口函数(就是内置的窗口操作函数, 方便使用):
LEAD(col[[, n][, DEFAULT]]) -- 与LAG函数相反, 分组内排序后往下第n行指定列的值
LAG(col[[, n][, DEFAULT]]) -- 分组内排序后往上第n行指定列的值
FIRST_VALUE(col) -- 分组内排序后的排名第一行指定列的值, 按照desc排序后可以取到排名最后一行指定列的值
LAST_VALUE(col)等, (不常用)
需要和partition by和order by语句联用, 不能接window子句
-- over子句(和标准聚合搭配使用, 为标准聚合提供分组操作和窗口操作能力)
COUNT
SUM
MIN
MAX
AVG等,
后跟PARTITION BY一个或者多个分区列, ORDER BY结果输出(注意order by不仅仅是排序的作用, 在和聚合函数联用的时候, 能够规定窗口的顺序), 此外还可以接window子句, 规定聚合的范围
-- 分析函数(和over子句联用, 实现分组操作)
RANK -- 组内排序, 排序字段相同值次序相同, 下一个值需要加上相同的值的个数
ROW_NUMBER -- 组内排序, 排序字段相同值随机排序, 没有重复次序
DENSE_RANK -- 组内排序, 排序字段相同值次序相同, 下一个值紧接着序号
CUME_DIST -- 小于等于当前值的组内行数/分组内总行数
PERCENT_RANK -- (不常用)分组内当前行的RANK值-1/分组内总行数-1
NTILE -- 顺序分箱, 能够将组内数据按照order by字段平分为n份, 并返回分箱值(hive数据平均分为n份)
需要与partition by和order by联用, 不能接window子句
-- 先开窗后聚合
SELECT
cast(t1.orderid AS string) AS sale_ord_id, -- 订单id
t1.skuid AS skuid, -- 商品id,
t1.sale_ordamount,
case
when t1.yn = 1
and t1.order_status in(2, 3)
then '0'
when t1.sale_ordamount >= 100000
then '1'
else '0'
end as bigamount_type, -- 大额订单标识
t1.`date` AS order_dt, -- 下单时间(天)
count(distinct t1.orderid, t1.skuid) AS order_line_cnt, -- 订单行量
sum(t1.order_price) AS order_price, -- 订单金额:除以100w得到元
t1.clickid AS click_id, -- 广告点击ID,京腾合约是md5urlid
t1.click_time AS click_dt, -- 广告点击时间
t1.real_time AS real_time, -- 是否当天订单:0-默认(未知)、1-是、2-否
cast(t1.my_self AS string) AS my_self, -- '跟单结果:1-间接 2-影响,3-直接'
t1.user_type AS user_type, -- 1_广告主、2_ 采销
t1.urlid AS urlid, -- 广告标识
'收订' AS order_status_type
FROM
(
SELECT
*,
sum(coalesce(order_price, 0.0) / 1000000) over(partition by cast(orderid AS string)) as sale_ordamount -- 订单金额
FROM
ad.ads_sz_order_detail_snapshot
WHERE
dt = sysdate( - 1)
) AS t1
WHERE
dt = sysdate( - 1)
AND yn = 1
AND
(
case
when t1.yn = 1
and t1.order_status in(2, 3)
then '0'
when t1.sale_ordamount >= 100000
then '1'
else '0'
end
)
= '0'
GROUP BY
cast(t1.orderid AS string),
t1.skuid,
case
when t1.yn = 1
and t1.order_status in(2, 3)
then '0'
when t1.sale_ordamount >= 100000
then '1'
else '0'
end,
t1.`date`,
t1.clickid,
t1.click_time,
t1.real_time,
cast(t1.my_self AS string),
t1.user_type,
t1.urlid,
t1.sale_ordamount
-- 先聚合后开窗
select t.shop_id,
t.dt,
pv,
-- min(pv) keep (dense_rank first order by dt) over (partition by shop_id) first_pv,
lag(pv, 3, pv) over(partition by shop_id order by dt) as prev_pv_3, --环比 --原始:lag向前 lead向后
lag(pv, 3, 0) over(partition by shop_id order by dt) as prev_pv_3v,
FIRST_VALUE(pv) OVER(PARTITION BY shop_id ORDER BY dt) AS first_pv,
round(avg(pv) over(partition by t.shop_id order by t.dt), 2) avg_pv, --逐月计算 平均值 月份+1
max(pv) over(partition by t.shop_id order by t.dt) max_pv,
min(pv) over(partition by t.shop_id order by t.dt) min_pv,
sum(pv) over(partition by t.shop_id order by t.dt) sum_pv,
count(*) over(partition by t.shop_id order by t.dt) count_num,
avg(t.pv) over(partition by shop_id order by dt rows between 2 preceding and 0 following) move_avg3_pv, --移动平均 步长=3
ntile(10) over(order by pv asc) pr_pv, --顺序分箱
stddev(pv) over(partition by shop_id order by dt) as sdev_pv, --标准差
dense_rank() over(partition by shop_id order by dt) as rank
from (select shop_id, dt, sum(pv) as pv
from ad.ad_user_all_pv a
where dt >= '2018-03-12'
and shop_id in ('732842',
'36201',
'58463',
'607805',
'77634',
'22838',
'13823')
group by shop_id, dt) t
order by shop_id, dt
以上是关于sql Hive窗口和分析函数,超过子句的主要内容,如果未能解决你的问题,请参考以下文章