在 SQL Server 中使用插入语句优化必要的 while 循环

Posted 2023-03-15

技术标签:

【中文标题】在 SQL Server 中使用插入语句优化必要的 while 循环【英文标题】：Optimize a necessary while loop with insert statement in SQL Server 【发布时间】：2021-04-19 16:34:21 【问题描述】：

我有一系列在 while 循环中运行一些数学运算的程序。他们正在计算运行平均值，其中知道先前计算的值对于获得下一个值是绝对必要的。一个项目与一组值配对，按日期排序，并使用 10 种不同的方法来计算运行平均值。

提前回答一些问题：每次运行的每个 id 的数组长度都是相同的，但如果数据丢失，则可能存在空值。我事先不知道正在计算滚动平均值的 id 数量，并且每次都会有所不同。我事先不知道必要的迭代次数，每次都会有所不同。尽管在微积分上工作了一年多，但我还没有找到一种数学方法来将系统扁平化为一个选择/加入——5 个是非线性的，而且我在大学里只有 4 个学期的微积分，所以我无法理解出一个扁平化的解决方案（我为我正在工作的 10 种方法中的 5 种做了，但剩下的 5 种要么太复杂，要么需要事先了解值到达的顺序——每次它们都是不同的运行）。

我的代码运行了，使用 while 循环的 5 个方法确实完成了，但是每个方法运行 5000 个项目（所有项目都有自己的迭代和计算）大约需要 15 分钟到 30 分钟。我需要能够扩展到 300000 个项目，所以 15 分钟是站不住脚的，特别是因为这是在数据库中运行的数十个过程之一。

下面是我的其中一种方法的代码示例。我不是在数学部分寻求帮助（因为需要循环的 5 个过程中的每一个的数学都不同），而是在用作循环背后意图的插入语句中：

create proc analysis.calculateavg_logarithmicshifted
    @maxi int --max number of iterations determined by separate proc
as
    
declare @icount int = 1;
declare @n int;
select @n = max(n) from #temp_pink; --comes from the procedure that calls this one, and is a way to identify the max array size
    
drop table if exists analysis.avglogarithmicshifted;
    
create table analysis.avglogarithmicshifted(    
    id nvarchar(64),    
    i int,  
    mu decimal(19, 6),  
    insertdate datetime,    
    avgname nvarchar(64)    
);
    
drop table if exists #temp_k;

--Pull only the historic data from the source
select  
    id, 
    k,  
    price
into #temp_k
from #temp_bbfull --original list of data, n elements per id
where history = 1;
    
while @icount <= @maxi  
    begin;      

    drop table if exists #temp_premu;
    
    --Calculate one sub-section of the rolling average
    select          
        id,         
        sum(log(price)) / (@n - sum(case when price is null then 1 else 0 end)) as premu        
    into #temp_premu        
    from #temp_k        
    where k between @icount and @icount + @n
    group by id;        
    
    drop table if exists #temp_f;           
    
    --Calculate the main component of the rolling average
    select          
        k.id,           
        @icount + @n as k,          
        exp(p.premu + (sum(power(log(k.price) - p.premu, 2)) / (2 * (@n - sum(case when k.price is null then 1 else 0 end))))) as price     
    into #temp_f        
    from #temp_k k          
    join #temp_premu p              
        on p.id= k.id           
    join #temp_bbfull bb                
        on bb.id= k.id              
        and bb.k = k.k      
    where k.k between @icount 
        and @icount + @n
    group by k.id, p.premu;         
    
    --Insert this iteration's rolling average into the table with incremented identifier k
    insert into #temp_k (id, k, price)      
    select          
        *       
    from #temp_f        
    where price is not null;        
    
    select @icount = @icount + 1;   
end;

--Insert final aggregated data into destination table
insert into analysis.avglogarithmicshifted (id, i, mu, insertdate, avgname)
select  
    k.id,   
    k.k - @n as I,  
    k.price as mu,  
    getdate() as insertdate,    
    'Logarithmic Shifted' as avgname
from #temp_k k
where k.k > @n  
    and k.price is not null;

名称和标识符已从我的原始代码中更改，但没有其他内容。任何帮助将不胜感激。

我使用的是 SQL Server 17.9.1

【问题讨论】：

老实说是WHILE真的有必要吗？在 SQL 中很少需要WHILE；非常很少。我怀疑你真的需要一个。 “我使用的是 SQL Server 17.9.1” 没有这样的东西； SQL Server 的最新版本是 15.0.4083.2（即 SQL Server 2019 RTM-CU8-GDR）。我们已经通过使用 SQL-CLR 聚合函数而不是 SUM(LOG... 来优化这个演算那么仅仅是您所做的 TRUNCATE 更改吗？你添加了什么索引？我在这里在黑暗中工作。解决这些问题的（继续）技巧是找到最昂贵的语句。启动 Profiler。使用模板 TSQL_SPs 并为您正在运行您的 proc 的 SPID 添加一个过滤器。这将告诉您哪些语句花费的时间最长，但也会增加开销。找到耗时最长的语句并对其进行调整。因为你在一个循环中，这将产生乘数积极的影响。您使用哪种类型的卷（记录数）？每列中有哪些值？作为参数传入的值范围是多少？ 1-100, 1-1,000,000 ? 【参考方案1】：

总的来说，如果不对代码进行重大重新设计，我认为您将无法获得接近 100 倍（2 个数量级）的改进。我不了解您的代码逻辑，因此无能为力，只能说 SQL Server 2016 支持允许基于集合计算移动平均线的窗口函数。这个链接可能是help

以下是一些优化它的技巧：

在连接多个临时表时，确保包含大量数据的表有索引。尽量减少用于存储中间计算结果的临时表的数量。确保中间临时表中的行数尽可能少。

从您给出的示例中，我可以看到一些可能的优化。由于我没有数据，我无法检查它们是否有帮助：

0 #temp_bbfull - 看起来这个表是用作数据源的，有很多数据。您必须有正确的索引。

CREATE INDEX IX_temp_bbfull ON #temp_bbfull( id, k )

1 尽可能限制你的工作集

select  id, k, price
into #temp_k
from #temp_bbfull --original list of data, n elements per id
-- As you are restricting iterations you might as well restrict the number of rows upfront
where history = 1 and k between @icount and @maxi;

2 这可能有帮助，也可能没有帮助。

CREATE INDEX IX_temp_k ON temp_k( id )

3 组合查询

select          
    k.id, @icount + @n as k,          
    exp(p.premu + (sum(power(log(k.price) - p.premu, 2)) / (2 * (@n - sum(case when k.price is null then 1 else 0 end))))) as price     
into #temp_f        
from #temp_k as k          
    join (
         select id, sum(log(price)) / (@n - sum(case when price is null then 1 else 0 end)) as premu        
         from #temp_k        
         where k between @icount and @icount + @n
         group by id ) as p
     on p.id= k.id
    join #temp_bbfull as bb on bb.id= k.id and bb.k = k.k      
where k.k between @icount and @icount + @n
group by k.id, p.premu;

等等。

【讨论】：

我可以缩小我使用的集合的大小，但不能缩小我需要每周运行的 ID 总数（它只会增长）。组合查询可能会有所帮助，但可能仅在我用作示例的方法中，因为其他 4 个具有不同的数学结构。查询计划和分析显示插入是处理时间和使用方面的杀手。我会尝试索引方法，看看是否有帮助，尽管从这一切来看，我认为我最大的希望在于数学家解决我的非线性展平问题。插入通常是 SQL Server 中最快的操作。您确定插入是问题，而不是产生结果的选择吗？如果插入确实是一个问题，请检查您的 tempDB 是否正确设置（有很多文章介绍了如何设置它以获得最佳性能）查看窗口函数。我认为您的逻辑（至少其中一些）可以迁移以使用它们：red-gate.com/simple-talk/sql/t-sql-programming/… 我对窗口函数非常熟悉，但是因为我不是在计算现有数据集的滚动平均值，而是在滚动时创建的数据上使用滚动平均值 -我很确定前面/之间的行（以及领先/滞后等）将不起作用。我考虑过进行窗口递归，但由于在开始之前我不知道最终的迭代次数，并且 maxrecursion 不允许变量，我认为它可能会导致一些主要的内部冲突，但值得测试。所以结果表明聚合函数在递归中是不允许的，所以我看不到让 windows 工作的方法，因为第一次之后的任何迭代都依赖于已经计算过的先前值。跨度>

以上是关于在 SQL Server 中使用插入语句优化必要的 while 循环的主要内容，如果未能解决你的问题，请参考以下文章