使用 RANK 或 ROW_NUMBER 创建 Group-able ID 以在 SQL Server 中使用难以捉摸的顺序交替连接行值
Posted
技术标签:
【中文标题】使用 RANK 或 ROW_NUMBER 创建 Group-able ID 以在 SQL Server 中使用难以捉摸的顺序交替连接行值【英文标题】:Create Group-able ID perhaps with RANK or ROW_NUMBER to concat row values with elusive sequential alternations in SQL Server 【发布时间】:2017-05-06 02:27:24 【问题描述】:在您做出判断之前,这个问题与该主题的大多数其他问题不同。是的,我确实想将某些行中的文本连接起来;但是,在大多数其他情况下,对于一个人希望连接的每一行,都有一个相同的 ID 值。在我的情况下,我似乎需要创建一个 ID 值,但问题是难以捉摸的,因为我似乎无法让 ROW_NUMBER() 或 RANK() 函数以我正在寻找的方式对值进行分区.
在数据中,随着 ID 顺序增加,我想设置一个列值,如 ROW_NUMBER(),但我希望每次 SpeakerID 更改时重置其计数。
我的数据如下所示:
<table><tbody><tr><th>ID</th><th>ConversationLine</th><th>SpeakerName</th><th>SpeakerID</th><th>TeacherLineIfSpeaking</th><th>StudentLineIfSpeaking</th><th>TeacherIDifSpeaking</th><th>StudentIDifSpeaking</th><th>CleanLineID</th><th>ConvID</th></tr><tr><td>1</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>Mr. Roberts </td><td>299875</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>NULL</td><td>299875</td><td>NULL</td><td>1</td><td>1</td></tr><tr><td>2</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>Mr. Roberts </td><td>299875</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>NULL</td><td>299875</td><td>NULL</td><td>2</td><td>1</td></tr><tr><td>3</td><td> The new phone right?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> The new phone right?</td><td>NULL</td><td>9695521</td><td>3</td><td>1</td></tr><tr><td>4</td><td> correct....</td><td>Mr. Roberts </td><td>299875</td><td> correct....</td><td>NULL</td><td>299875</td><td>NULL</td><td>4</td><td>1</td></tr><tr><td>5</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>Mr. Roberts </td><td>299875</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>NULL</td><td>299875</td><td>NULL</td><td>5</td><td>1</td></tr><tr><td>6</td><td> I don't know ?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I don't know ?</td><td>NULL</td><td>9695521</td><td>6</td><td>1</td></tr><tr><td>7</td><td> Area of a rectangle = length x width</td><td>Mr. Roberts </td><td>299875</td><td> Area of a rectangle = length x width</td><td>NULL</td><td>299875</td><td>NULL</td><td>7</td><td>1</td></tr><tr><td>8</td><td> start with 'difference in areas = '</td><td>Mr. Roberts </td><td>299875</td><td> start with 'difference in areas = '</td><td>NULL</td><td>299875</td><td>NULL</td><td>8</td><td>1</td></tr><tr><td>9</td><td> after you clear your student answer box</td><td>Mr. Roberts </td><td>299875</td><td> after you clear your student answer box</td><td>NULL</td><td>299875</td><td>NULL</td><td>9</td><td>1</td></tr><tr><td>10</td><td> I already did</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I already did</td><td>NULL</td><td>9695521</td><td>10</td><td>1</td></tr></tbody></table>
我想要的是这样的(注意新列,左数第二个):
<table><tbody><tr><th>ID</th><th>ChatID</th><th>ConversationLine</th><th>SpeakerName</th><th>SpeakerID</th><th>TeacherLineIfSpeaking</th><th>StudentLineIfSpeaking</th><th>TeacherIDifSpeaking</th><th>StudentIDifSpeaking</th><th>CleanLineID</th><th>ConvID</th></tr><tr><td>1</td><td> 1</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>Mr. Roberts </td><td>299875</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>NULL</td><td>299875</td><td>NULL</td><td>1</td><td>1</td></tr><tr><td>2</td><td> 1</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>Mr. Roberts </td><td>299875</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>NULL</td><td>299875</td><td>NULL</td><td>2</td><td>1</td></tr><tr><td>3</td><td> 2</td><td> The new phone right?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> The new phone right?</td><td>NULL</td><td>9695521</td><td>3</td><td>1</td></tr><tr><td>4</td><td> 3</td><td> correct....</td><td>Mr. Roberts </td><td>299875</td><td> correct....</td><td>NULL</td><td>299875</td><td>NULL</td><td>4</td><td>1</td></tr><tr><td>5</td><td> 3</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>Mr. Roberts </td><td>299875</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>NULL</td><td>299875</td><td>NULL</td><td>5</td><td>1</td></tr><tr><td>6</td><td> 4</td><td> I don't know ?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I don't know ?</td><td>NULL</td><td>9695521</td><td>6</td><td>1</td></tr><tr><td>7</td><td> 5</td><td> Area of a rectangle = length x width</td><td>Mr. Roberts </td><td>299875</td><td> Area of a rectangle = length x width</td><td>NULL</td><td>299875</td><td>NULL</td><td>7</td><td>1</td></tr><tr><td>8</td><td> 5</td><td> start with 'difference in areas = '</td><td>Mr. Roberts </td><td>299875</td><td> start with 'difference in areas = '</td><td>NULL</td><td>299875</td><td>NULL</td><td>8</td><td>1</td></tr><tr><td>9</td><td> 5</td><td> after you clear your student answer box</td><td>Mr. Roberts </td><td>299875</td><td> after you clear your student answer box</td><td>NULL</td><td>299875</td><td>NULL</td><td>9</td><td>1</td></tr><tr><td>10</td><td> 6</td><td> I already did</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I already did</td><td>NULL</td><td>9695521</td><td>10</td><td>1</td></tr></tbody></table>
我意识到,一旦我有了可以用来对值进行分组的 ChatID,我就可以使用带有 FOR XML 的递归 CTE 或 STUFF(..) 或带有变量或 CLR 函数等的 COALESCE 来实际执行串联。
我正在运行 SQL Server 2016。
另外,我应该提到的另一件事是,对话序列可能持续多长时间无法预测。在对话改变说话人之前,说话人可能有 40 条连续消息(即行)的序列长度,因此使用固定数量的内部连接的技术是不够的。 此外,该解决方案的性能需要合理,因为该数据库中有超过 1600 万行。
这是表格格式的数据的精简版本(减去一些使其格式不那么好的额外列)。 输入:
╔════╦═══════════════════════╦══════════════╦═══════════╦═════════════╦════════╗
║ ID ║ ConversationLine ║ SpeakerName ║ SpeakerID ║ CleanLineID ║ ConvID ║
╠════╬═══════════════════════╬══════════════╬═══════════╬═════════════╬════════╣
║ 1 ║ Hi! Let's look... ║ Mr. Roberts ║ 299875 ║ 1 ║ 1 ║
║ 2 ║ Hi Gabriela... ║ Mr. Roberts ║ 299875 ║ 2 ║ 1 ║
║ 3 ║ The new phone right? ║ Gabriela ║ 9695521 ║ 3 ║ 1 ║
║ 4 ║ correct.... ║ Mr. Roberts ║ 299875 ║ 4 ║ 1 ║
║ 5 ║ what will you ...? ║ Mr. Roberts ║ 299875 ║ 5 ║ 1 ║
║ 6 ║ I don't know ? ║ Gabriela ║ 9695521 ║ 6 ║ 1 ║
║ 7 ║ Area of = ... ║ Mr. Roberts ║ 299875 ║ 7 ║ 1 ║
║ 8 ║ start with ... ║ Mr. Roberts ║ 299875 ║ 8 ║ 1 ║
║ 9 ║ after you ... ║ Mr. Roberts ║ 299875 ║ 9 ║ 1 ║
║ 10 ║ I already did ║ Gabriela ║ 9695521 ║ 10 ║ 1 ║
╚════╩═══════════════════════╩══════════════╩═══════════╩═════════════╩════════╝
以及所需的输出:
╔════╦════════╦══════════════════════╦═════════════╦═══════════╦═════════════╦════════╗
║ ID ║ ChatID ║ ConversationLine ║ SpeakerName ║ SpeakerID ║ CleanLineID ║ ConvID ║
╠════╬════════╬══════════════════════╬═════════════╬═══════════╬═════════════╬════════╣
║ 1 ║ 1 ║ Hi! Let's look... ║ Mr. Roberts ║ 299875 ║ 1 ║ 1 ║
║ 2 ║ 1 ║ Hi Gabriela... ║ Mr. Roberts ║ 299875 ║ 2 ║ 1 ║
║ 3 ║ 2 ║ The new phone right? ║ Gabriela ║ 9695521 ║ 3 ║ 1 ║
║ 4 ║ 3 ║ correct.... ║ Mr. Roberts ║ 299875 ║ 4 ║ 1 ║
║ 5 ║ 3 ║ what will you ...? ║ Mr. Roberts ║ 299875 ║ 5 ║ 1 ║
║ 6 ║ 4 ║ I don't know ? ║ Gabriela ║ 9695521 ║ 6 ║ 1 ║
║ 7 ║ 5 ║ Area of = ... ║ Mr. Roberts ║ 299875 ║ 7 ║ 1 ║
║ 8 ║ 5 ║ start with ... ║ Mr. Roberts ║ 299875 ║ 8 ║ 1 ║
║ 9 ║ 5 ║ after you ... ║ Mr. Roberts ║ 299875 ║ 9 ║ 1 ║
║ 10 ║ 6 ║ I already did ║ Gabriela ║ 9695521 ║ 10 ║ 1 ║
╚════╩════════╩══════════════════════╩═════════════╩═══════════╩═════════════╩════════╝
编辑:对于那些对提议的解决方案的查询执行计划感兴趣的人,这是@kannan-kandasamy 解决方案的估计查询执行计划(您可能需要在新窗口中打开以放大并查看它,因为图片太宽了):
这是@vkp 解决方案的估计查询执行计划:
编辑 2:它们在同一批次中:
有趣的是,当我在同一批次中运行它们时,它显示@vkp 的解决方案需要 99% 的批次成本。但是,当我运行两个查询以 SELECT INTO 一个新表时,@vkp 的解决方案运行时间不到 1/5。
以下是@kannan-kandasamy 解决方案的聚集索引扫描属性: @vkp 的解决方案的聚集索引扫描属性似乎对于每个统计和逻辑值都是相同的(除了它说 @vkp 的解决方案的估计操作员成本为 1%,但 @kannan-kandasamy 的解决方案的估计操作员成本为 91%,即使实际操作员成本值是相同的)。
【问题讨论】:
你能以表格格式发布输入和输出吗? 好的,等一下。 @vkp 这样更好吗? 是的。好多了。我运行了以前的 sn-p 并了解您在寻找什么。 【参考方案1】:这可以通过不同的行号方法来完成。 (运行最里面的查询以查看具有相同 speaker_id 的连续行如何分配给同一组)。然后获取每个群组的起始id,并使用dense_rank
按要求获取chat_id,依次类推。
select t.*,dense_rank() over(order by id_strt) as chat_id
from (select t.*,min(id) over(partition by grp,speakerid) as id_strt
from (select t.*
,row_number() over(order by id)-row_number() over(partition by speakerid order by id) as grp
from t
) t
) t
如果您只需要 chat_id 来识别一个组并连接值,那么最内层的查询就足够了。当你 group
ing 时,只需 group by grp,speakerid
。
【讨论】:
这是一个非常快速的解决方案! 仅供参考,我需要对其进行编辑以在两个 PARTITION BY 子句的末尾包含 convID,如下所示:( PARTITION BY grp, speakerid, convID ) 和 ( PARTITION BY SpeakerID, convID ORDER BY ID ) .否则,如果对话发生变化但同一说话者仍在说话(例如与不同的人说话),它将不会提供新的 chat_id。 @devinbost .. 由于问题只显示了一个 convID,我不确定是否包含它..根据您所说的应该包含它并且您已经这样做了..【参考方案2】:您可以使用前导和窗口总和来实现此目的:
select *, ChatId = sum(case when speakerid <> NextSpeakerid then 1 else 0 end) over(order by id)+1 from (
select *, NextSpeakerid = lag(speakerid, 1, null) over(order by id) from #yourgroup
) a
此查询的输出:
+----+--------------+-----------+--------+
| ID | SpeakerName | speakerid | ChatID |
+----+--------------+-----------+--------+
| 1 | Mr. Roberts | 25239875 | 1 |
| 2 | Mr. Roberts | 25239875 | 1 |
| 3 | Gabriela | 19645521 | 2 |
| 4 | Mr. Roberts | 25239875 | 3 |
| 5 | Mr. Roberts | 25239875 | 3 |
| 6 | Gabriela | 19645521 | 4 |
| 7 | Mr. Roberts | 25239875 | 5 |
| 8 | Mr. Roberts | 25239875 | 5 |
| 9 | Mr. Roberts | 25239875 | 5 |
| 10 | Gabriela | 19645521 | 6 |
+----+--------------+-----------+--------+
你的桌子:
create table #yourgroup (ID int identity(1,1), speakername varchar(20), speakerid int)
insert into #yourgroup ( speakername, speakerid) values
('Mr. Roberts ', 25239875 )
,('Mr. Roberts ', 25239875 )
,('Gabriela ', 19645521 )
,('Mr. Roberts ', 25239875 )
,('Mr. Roberts ', 25239875 )
,('Gabriela ', 19645521 )
,('Mr. Roberts ', 25239875 )
,('Mr. Roberts ', 25239875 )
,('Mr. Roberts ', 25239875 )
,('Gabriela ', 19645521 )
【讨论】:
我很欣赏这种方法的简单性,但这种方法比@vkp 的方法慢 5 倍以上,可能是由于 SUM 操作。 技术上它不应该很慢。你能发布执行计划吗?以便我们进一步检查.. 我编辑了我的帖子以包含查询执行计划的图片。 您可以尝试在同一批次中执行这两个查询吗?在我的查询中,91% 是聚集索引扫描本身... 知道为什么我会得到这些结果吗?以上是关于使用 RANK 或 ROW_NUMBER 创建 Group-able ID 以在 SQL Server 中使用难以捉摸的顺序交替连接行值的主要内容,如果未能解决你的问题,请参考以下文章
hive的row_number()rank()和dense_rank()的区别以及具体使用
简单集合枚举的 MS SQL row_number/rank 替代方案
[转]oracle分析函数Rank, Dense_rank, row_number
Oracle 的开窗函数 rank,dense_rank,row_number