非聚集列存储索引与 bigint 字段上的传统非聚集行存储索引

Posted 2023-03-25

技术标签:

【中文标题】非聚集列存储索引与 bigint 字段上的传统非聚集行存储索引【英文标题】：Non-clustered columnstore index vs traditional non-clustered rowstore index on bigint field 【发布时间】：2020-05-22 17:34:03 【问题描述】：

我正在从具有以下结构和索引的表中读取数据


SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Bets](
    [bwd_BetTicketNr] [bigint] NOT NULL,
    [bwd_LineID] [int] NOT NULL,
    [bwd_ResultID] [bigint] NOT NULL,
    [bwd_LineStake] [bigint] NULL,
    [bwd_CreatedAt] [datetime] NULL,
    [bwd_DateModified] [datetime] NULL,
 CONSTRAINT [PK_BetwayDetails] PRIMARY KEY CLUSTERED 
(
    [bwd_BetTicketNr] ASC,
    [bwd_LineID] ASC,
    [bwd_ResultID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [idx___Bets__bwd_CreatedAt] ON [dbo].[Bets]
(
    [bwd_CreatedAt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO


CREATE NONCLUSTERED INDEX [idx___Bets__bwd_DateModified] ON [dbo].[Bets]
(
    [bwd_DateModified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE NONCLUSTERED COLUMNSTORE INDEX [nccs___Bets] ON [dbo].[Bets]
(
    [bwd_BetTicketNr]
)WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY]
GO

我想了解表开发人员决定在 bwd_BetticketNr 列上使用非聚集列存储索引，而不是像在日期列中那样使用经典行存储。

生产表约 60 亿行，唯一 bwd_Betticketnr 值约 5000 万

使用行存储和列存储对多达 5000 万行的测试表运行查询具有相似的性能，因此我无法模拟缩放。由于数据类型和基数，列存储是否更合适？

我试图找到类似的问题/帖子/博客进行此类比较，但我还没有找到任何东西。

我使用的是 SQL Server 2017。

【问题讨论】：

这样做的一个很好的副作用是，即使在不使用列存储索引的查询上也可以获得批处理模式（rowstore 上的批处理模式在 2019 年不可用）。但是为此添加的索引通常会被过滤 【参考方案1】：

列存储索引针对压缩和扫描速度进行了优化，对于需要索引扫描的查询，逻辑 IO 和 CPU 应该会显着减少。

这里可能的目标是：

select count(*) from bets

否则会扫描其中一个非聚集索引。

或

select count(*) from bets where bwd_BetTicketNr = @tn

否则会对聚集索引进行部分范围扫描。

例如

SET QUOTED_IDENTIFIER ON
GO
--drop table if exists bets
go

CREATE TABLE [dbo].[Bets](
    [bwd_BetTicketNr] [bigint] NOT NULL,
    [bwd_LineID] [int] NOT NULL,
    [bwd_ResultID] [bigint] NOT NULL,
    [bwd_LineStake] [bigint] NULL,
    [bwd_CreatedAt] [datetime] NULL,
    [bwd_DateModified] [datetime] NULL,
 CONSTRAINT [PK_BetwayDetails] PRIMARY KEY CLUSTERED 
(
    [bwd_BetTicketNr] ASC,
    [bwd_LineID] ASC,
    [bwd_ResultID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [idx___Bets__bwd_CreatedAt] ON [dbo].[Bets]
(
    [bwd_CreatedAt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO


CREATE NONCLUSTERED INDEX [idx___Bets__bwd_DateModified] ON [dbo].[Bets]
(
    [bwd_DateModified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE NONCLUSTERED COLUMNSTORE INDEX [nccs___Bets] ON [dbo].[Bets]
(
    [bwd_BetTicketNr]
)WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY]
GO

with 
n as 
(
    select top (10*1000*1000) row_number() over (order by (select null)) i
    from sys.objects o, sys.messages m

), q as
(
  select 
    i [bwd_BetTicketNr] 
   ,i [bwd_LineID] 
   ,i [bwd_ResultID] 
   ,i [bwd_LineStake]
   ,getdate() [bwd_CreatedAt]
   ,getdate() [bwd_DateModified]
   from n
)
insert into Bets
select * from q
alter index all on Bets rebuild with (online=off)

set statistics io on
set statistics time on

select count(*) from bets with (index=[idx___Bets__bwd_DateModified])

set statistics io off
set statistics time off
/*
Table 'Bets'. Scan count 1, logical reads 42229, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 563 ms,  elapsed time = 563 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
*/

set statistics io on
set statistics time on

select count(*) from bets with (index=[nccs___Bets])

set statistics io off
set statistics time off
/*
Table 'Bets'. Scan count 2, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 9816, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Bets'. Segment reads 10, segment skipped 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 19 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
*/

【讨论】：

为了进一步帮助您，此列大量用于join和group by子句。【参考方案2】：

通常，列存储索引比对行存储索引的索引扫描要快——即使行存储是使用ROW 或PAGE 压缩创建的。（NOTE：SQL Server Standard Edition 的批处理模式操作的并行度 (DOP) 限制为 2，SQL Server Web 和 Express Edition 限制为 1。这是指在基于磁盘的表上创建的列存储索引和内存优化表。）我看到了一些示例，其中使用两个线程的列存储索引扫描比使用大量线程（例如 8 个线程）的行存储索引扫描需要更多的经过时间。但是，列存储索引的 worker 时间实际上可能更短。

在我们问“为什么？”之前，让我们看看数据库引擎要说什么......

是否使用了索引？ 索引使用统计数据可以告诉我们索引是否正在用于查找、扫描或查找。此查询将告诉您dbo.Bets 上的索引发生了什么。列存储索引不用于查找，因此请密切注意结果中的user_scans。

USE [your database]
GO

SELECT i.name, ius.*
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats ius
    ON ius.object_id = i.object_id
    AND ius.index_id = i.index_id
    AND ius.database_id = DB_ID()
WHERE i.object_id = object_id('dbo.Bets')

如何使用索引？ 假设列存储索引的索引扫描次数大于零，您可以在计划缓存中搜索使用该索引的执行计划。试试这个查询，它来自 Jonathan Kehayias 的文章 Finding what queries in the plan cache use a specific index：

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
DECLARE @IndexName AS NVARCHAR(128) = 'nccs___Bets';

-- Make sure the name passed is appropriately quoted
IF (LEFT(@IndexName, 1) <> '[' AND RIGHT(@IndexName, 1) <> ']') SET @IndexName = QUOTENAME(@IndexName);
-- Handle the case where the left or right was quoted manually but not the opposite side
IF LEFT(@IndexName, 1) <> '[' SET @IndexName = '['+@IndexName;
IF RIGHT(@IndexName, 1) <> ']' SET @IndexName = @IndexName + ']';

-- Dig into the plan cache and find all plans using this index
;WITH XMLNAMESPACES
    (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')   
SELECT
stmt.value('(@StatementText)[1]', 'varchar(max)') AS SQL_Text,
obj.value('(@Database)[1]', 'varchar(128)') AS DatabaseName,
obj.value('(@Schema)[1]', 'varchar(128)') AS SchemaName,
obj.value('(@Table)[1]', 'varchar(128)') AS TableName,
obj.value('(@Index)[1]', 'varchar(128)') AS IndexName,
obj.value('(@IndexKind)[1]', 'varchar(128)') AS IndexKind,
cp.plan_handle,
query_plan
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
CROSS APPLY stmt.nodes('.//IndexScan/Object[@Index=sql:variable("@IndexName")]') AS idx(obj)
OPTION(MAXDOP 1, RECOMPILE);

那么，回到最初的问题：为什么开发人员决定使用这些索引？如果列存储索引正在使用中，那么对相关查询计划及其基础查询的检查应该会相当有启发性。

【讨论】：

以上是关于非聚集列存储索引与 bigint 字段上的传统非聚集行存储索引的主要内容，如果未能解决你的问题，请参考以下文章