Clickhouse数据库嵌套结构中的主键

Posted 2023-03-25

技术标签:

【中文标题】Clickhouse数据库嵌套结构中的主键【英文标题】：Primary key in nested structure on clickhouse database 【发布时间】：2018-12-18 17:24:06 【问题描述】：

在 clickhouse 中，我创建了一个具有嵌套结构的表

CREATE TABLE IF NOT EXISTS table_name (
    timestamp Date,
    str_1 String,
    Nested_structure Nested (
        index_array UInt32,
        metric_2 UInt64,
        metric_3 UInt8
    ),
    sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1)

我将提出的查询如下：

 SELECT count(*) AS count FROM table_name
 WHERE (timestamp = '2017-09-01')
 AND
 arrayFirst((i, x) -> x = 7151, Nested_structure.metric_2, Nested_structure.index_array) > 50000

我想计算 str_1 其中：从 indexed_array 匹配的索引中（数组）列 metric_2 的值 7151 大于给定阈值 (50000)

我想知道是否可以为列设置主键：index_array 以加快查询速度。

如果我在 order by 子句中添加列：Nested_structure.index_array，则假定它是大表的数组列，而不是 Nested_structure 的 indexed_array 列的各个值

例如ORDER BY (timestamp, str_1, Nested_structure.index_array)

算法是：

index_array

如果 index_array 已排序并且表知道这一点，那么步骤 (1) 可能会更快（例如使用二进制搜索算法）

有人有想法吗？

=============

编辑

列的基数： str_1 15,000,000 百万个不同的值 index_array：15,000 - 20,000 千个不同的值

假设 index_array 的不同值为：column_1，...，column_15000，那么一个非规范化的表应该具有以下结构：

timestamp,
str_1,
column_1a, <--  store values for metric_2
...
column_15000a, <--  store values for metric_2
column_1b, <--  store values for metric_3
...
column_15000b, <--  store values for metric_3

@Amos 如果我使用 LowCardinality 类型的列，您能否告诉我表格的结构？

【问题讨论】：

【参考方案1】：

我想知道是否可以为列设置一个主键：index_array 以加快查询速度。

不，ClickHouse 没有数组索引。如果您提供Nested_structure.index_array 作为order by 子句中的第三个参数，它只会在考虑到数组列的情况下对整行进行排序。注意，[1,2] < [1,2,3]。

您可以在没有嵌套列的情况下对表进行非规范化，并将前两列设为 LowCardinality 类型，这几乎可以投入生产了。

更新

您似乎不会从LowCardinality 类型中受益匪浅。我的意思是做这样的事情

CREATE TABLE IF NOT EXISTS table_name (
    timestamp Date,
    str_1 String,
    index_array UInt32,
    metric_2 UInt64,
    metric_3 UInt8,
    sign Int8 DEFAULT 1
) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)

你仍然可以通过这样做来使用旧的插入逻辑

CREATE TABLE IF NOT EXISTS table_name ( timestamp Date, str_1 String, index_array UInt32, metric_2 UInt64, metric_3 UInt8, sign Int8 DEFAULT 1 ) ENGINE = CollapsingMergeTree(sign) PARTITION BY (toYYYYMM(timestamp)) ORDER BY (timestamp, str_1, index_array)

CREATE TABLE IF NOT EXISTS source_table ( timestamp Date, str_1 String, Nested_structure Nested ( index_array UInt32, metric_2 UInt64, metric_3 UInt8 ), sign Int8 DEFAULT 1 ) ENGINE Null;

create materialized view data_pipe to table_name as select timestamp, str_1, Nested_structure.index_array index_array, Nested_structure.metric_2 metric_2, Nested_structure.metric_3 metric_3, sign from source_table array join Nested_structure;

insert into source_table values (today(), 'fff', [1,2,3], [2,3,4], [3,4,5], 1);

【讨论】：

不幸的是，嵌套表中的列（index_array、metric_2、metric_3）具有足够大的基数（~15,000 千个不同的值）。所以我需要一个非规范化表中超过 10,000 列。顺便说一句，我在哪里可以找到LowCardinality 的文档？我的意思是仅将LowCardinality 应用于前两列。文档尚不可用。 ETA 是今年年底。

以上是关于Clickhouse数据库嵌套结构中的主键的主要内容，如果未能解决你的问题，请参考以下文章

Clickhouse一级索引优化方案

从 mongoDB 迁移到 clickhouse 中的嵌套数据结构

ClickHouse 技术系列- ClickHouse 中的嵌套数据结构

Clickhouse重复数据处理

ClickHouse 极简教程-图文详解原理系列ClickHouse 主键索引的存储结构与查询性能优化...