HiveQL实现累积求和

Posted 2023-04-20

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了HiveQL实现累积求和相关的知识，希望对你有一定的参考价值。

参考技术A 1.需求
有如下访客访问次数的统计表 t_access

要求输出每个客户在每个月的总访问次数，以及在当前月份之前所有月份的累积访问次数。
输出表

2.思路

1）第一步，先求每个用户的月总访问次数

2）第二步，将月总访问次数表自己连接自己连接(内连接)

3）第三步，从上一步的结果中进行分组查询，分组的字段是a.username a.month，求月累计值：将b.month <= a.month的所有b.salary求和即可

3.HQL

按索引对numpy数组的累积求和

【中文标题】按索引对numpy数组的累积求和【英文标题】：Cumulative summation of a numpy array by index 【发布时间】：2011-04-06 02:00:17 【问题描述】：

假设您有一个需要相加的值数组

d = [1,1,1,1,1]

第二个数组指定哪些元素需要相加

i = [0,0,1,2,2]

结果将存储在大小为max(i)+1 的新数组中。因此，例如i=[0,0,0,0,0] 相当于将d 的所有元素相加并将结果存储在大小为1 的新数组的0 位置。

我尝试使用

c = zeros(max(i)+1)
c[i] += d

但是，+= 操作仅将每个元素添加一次，从而给出了意外的结果

[1,1,1]

而不是

[2,1,2]

如何正确实现这种求和？

【问题讨论】：

如果d 的值是唯一的，这会更清楚。例如，如果d = [0,1,2,3,4] Im guessing for i = [0,0,0,0,0]` 你想要c = [10]，而对于i = [0,0,1,2,2] 你想要c = [1,2,7]？没错。感谢您的澄清。在这种情况下，juxstapose 的解决方案以及我在 cmets 中建议的更改应该可以解决问题。 【参考方案1】：

如果我理解正确的话，有一个快速的函数（只要数据数组是 1d）

>>> i = np.array([0,0,1,2,2])
>>> d = np.array([0,1,2,3,4])
>>> np.bincount(i, weights=d)
array([ 1.,  2.,  7.])

np.bincount 返回所有整数 range(max(i)) 的数组，即使某些计数为零

【讨论】：

这是这里描述的案例的最佳解决方案。对于标记数组的一般总和，您可以使用 scipy.ndimage.sum。该模块还具有其他有用的功能，例如最大值、最小值、均值、方差、...【参考方案2】：

Juh_ 的评论是最有效的解决方案。这是工作代码：

import numpy as np
import scipy.ndimage as ni

i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])

n_indices = i.max() + 1
print ni.sum(d, i, np.arange(n_indices))

【讨论】：

【参考方案3】：

此解决方案对于大型数组应该更有效（它迭代可能的索引值而不是 i 的单个条目）：

import numpy as np

i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])

i_max = i.max()
c = np.empty(i_max+1)
for j in range(i_max+1):
    c[j] = d[i==j].sum()

print c
[1. 2. 7.]

【讨论】：

【参考方案4】：

def zeros(ilen):
 r = []
 for i in range(0,ilen):
     r.append(0)

i_list = [0,0,1,2,2]
d = [1,1,1,1,1]
result = zeros(max(i_list)+1)

for index in i_list:
  result[index]+=d[index]

print result

【讨论】：

关闭，但我认为 OP 想要for didx,ridx in enumerate(i_list): result[ridx] += d[didx]。此外，由于标签包含 [numpy]，您可以使用 numpy.zeros。【参考方案5】：

在一般情况下，当您想按标签对子矩阵求和时，您可以使用以下代码

import numpy as np
from scipy.sparse import coo_matrix

def labeled_sum1(x, labels):
     P = coo_matrix((np.ones(x.shape[0]), (labels, np.arange(len(labels)))))
     res = P.dot(x.reshape((x.shape[0], np.prod(x.shape[1:]))))
     return res.reshape((res.shape[0],) + x.shape[1:])

def labeled_sum2(x, labels):
     res = np.empty((np.max(labels) + 1,) + x.shape[1:], x.dtype)
     for i in np.ndindex(x.shape[1:]):
         res[(...,)+i] = np.bincount(labels, x[(...,)+i])
     return res

第一种方法使用稀疏矩阵乘法。第二个是user333700答案的概括。两种方法的速度相当：

x = np.random.randn(100000, 10, 10)
labels = np.random.randint(0, 1000, 100000)
%time res1 = labeled_sum1(x, labels)
%time res2 = labeled_sum2(x, labels)
np.all(res1 == res2)

输出：

Wall time: 73.2 ms
Wall time: 68.9 ms
True

【讨论】：

以上是关于HiveQL实现累积求和的主要内容，如果未能解决你的问题，请参考以下文章

Hive--14---使用sum() over() 实现累积求和

Hive：对指定组求和（HiveQL）

HiveQL编译基础

在 SQL Server 中计算递减的累积和

在 BigQuery Java UDF 中对数组进行累积求和时出现问题