有啥方法可以计算 Redshift 中条件的运行总计？

Posted 2023-03-31

技术标签:

【中文标题】有啥方法可以计算 Redshift 中条件的运行总计？【英文标题】：Is there any way to calculate running total with condition in Redshift?有什么方法可以计算 Redshift 中条件的运行总计？ 【发布时间】：2020-07-18 05:01:56 【问题描述】：

我正在为 Redshift 的一个包中心运行一个卷可用性模型。在此表中，B 列显示每小时的到达量。班次从 1700 开始，到午夜结束。在那段时间里，他们每小时可以处理 50K 包（C 列）。我有前三列的表，我想知道是否有任何方法可以计算 Redshift 的 D 列？

【问题讨论】：

欢迎来到 *** =) 请为您的问题添加更多详细信息，以便社区其他人更容易为您提供帮助：***.com/help/how-to-ask 嗨，维克多，为了更清楚。我正在尝试在这里计算 D 列。例如 D8 =IF(B8+D7-C8 【参考方案1】：

您是正确的，我之前的答案缺少一个术语。我今天在集群上花了一些时间，并编写了一个测试用例。下面是修改后的 SQL 和设置语句。它需要一个新术语，它是一个窗口函数，因为它们不能嵌套另一个选择层。我希望这个示例有所帮助，而且我知道解决非递归数据库上的递归问题可能很困难。

drop table if exists package_volume;

create table package_volume (
        A timestamp encode zstd,
        B int encode zstd,
        C int encode zstd);

insert into package_volume values
('2020-06-26 13:00', 0, 0),
('2020-06-26 14:00', 3500, 0),
('2020-06-26 15:00', 3200, 0),
('2020-06-26 16:00', 6500, 0),
('2020-06-26 17:00', 5200, 50000),
('2020-06-26 18:00', 51000, 50000),
('2020-06-26 19:00', 120000, 50000),
('2020-06-26 20:00', 30000, 50000),
('2020-06-26 21:00', 40000, 50000),
('2020-06-26 22:00', 15000, 50000),
('2020-06-26 23:00', 5500, 50000),
('2020-06-27 00:00', 0, 0);

commit;

select A, B, C, 
        run_tot_pack - run_tot_capacity + sum(unrealized_capacity) over (order by A rows unbounded preceding) as available_volume    
from (
    select A, B, C, run_tot_pack, run_tot_capacity, 
        decode(unrealized_capacity - max(unrealized_capacity) over (order by A rows between unbounded preceding and 1 preceding) < 0, true, 0, 
            unrealized_capacity - max(unrealized_capacity) over (order by A rows between unbounded preceding and 1 preceding)) as unrealized_capacity
    from (
                    select A, B, C,
                        sum(B) over (order by A rows unbounded preceding) as run_tot_pack,
                        sum(C) over (order by A rows unbounded preceding) as run_tot_capacity,
                        decode(run_tot_pack - run_tot_capacity < 0, true, run_tot_capacity - run_tot_pack, 0) as unrealized_capacity
            from package_volume
        )
)
order by A;

【讨论】：

哇！这很聪明。它对我有用。谢谢比尔。非常感谢您的帮助。【参考方案2】：

我想我知道你想要什么，但如果我没有回答你的问题，请提供更多细节。要获得运行总计，您需要使用 SUM() 窗口函数，该函数可以对所有先前行的值求和。

SUM("arrived packages") over ( order by timeinterval rows unbounded preceding )

将为您提供“已到达包裹”的运行总数。现在这不是您想要的，但让我们先介绍一下这个重要功能。

最后一个要求是这变得棘手。您不能为以后“存储”未使用的容量 - 未使用的容量会丢失。因此，所有可以处理 50,000 个包裹的时间都可以。这需要分两步完成（查询和子查询） - 首先找到到达包的运行总数和可用吞吐量。然后取这些之间的差异，但在任何时候有未使用的容量时加回来。基本上采取简单的方法并将错误作为最终调整。否则这将成为一个递归问题，Redshift 不喜欢这些问题。（对不起，下面的 SQL 未经测试，所以把它当作概念）

select timeinterval, "arrived packages", "throughput per hour",
    run_tot_pack - run_tot_capacity + 
        sum(decode(run_tot_pack - run_tot_capacity < 0, true, run_tot_capacity - run_tot_pack, 0)) over (order by timeinterval rows unbounded preceding) as "available volume"    
from (
    select timeinterval, "arrived packages", "throughput per hour",
        sum("arrived packages") over (order by timeinterval rows unbounded preceding) as run_tot_pack,
        sum("throughput per hour") over (order by timeinterval rows unbounded preceding) as run_tot_capacity
    from <table>
)
order by timeinterval;

【讨论】：

您好比尔，感谢您的回复。我在 Redshift 之外尝试了这个逻辑，我认为这接近我想要实现的目标。该代码可以正确生成前 5 行，但是在 D 列的第一个零之后开始生成错误的数字。

以上是关于有啥方法可以计算 Redshift 中条件的运行总计？的主要内容，如果未能解决你的问题，请参考以下文章