在 Clickhouse 中按运算符获取前 n 行无顺序

Posted

技术标签:

【中文标题】在 Clickhouse 中按运算符获取前 n 行无顺序【英文标题】:Get top n rows without order by operator in Clickhouse 【发布时间】:2019-12-27 12:04:39 【问题描述】:

我有一张桌子

CREATE TABLE StatsFull (
  Timestamp Int32,
  Uid String,
  ErrorCode Int32,
  Name String,
  Version String,
  Date Date MATERIALIZED toDate(Timestamp),
  Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192

我需要获得具有唯一 Uid 的前 100 个名称或前 100 个错误代码。 明显的查询是

SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100

但数据太大,所以我创建了一个 AggregatingMergeTree,因为我不需要按小时(仅按日期)过滤数据。

CREATE MATERIALIZED VIEW StatsAggregated (
  Date Date,
  ProductName String,
  ErrorCode Int32,
  Name String,
  Version String,
  UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
  (
    Date,
    ProductName,
    ErrorCode,
    Name,
    Version
  ) SETTINGS index_granularity = 8192 AS
SELECT
  Date,
  ProductName,
  ErrorCode,
  Name,
  Version,
  uniqState(Uid) AS UniqUsers,
FROM
  StatsFull
GROUP BY
  Date,
  ProductName,
  ErrorCode,
  Name,
  Version

而我目前的查询是:

SELECT Name FROM StatsAggregated 
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100

查询运行良好,但最终一天中的数据行变得更多,现在它对内存太贪婪了。所以我正在寻找一些优化。

我找到了函数 topK(N)(column),它返回指定列中最常见值的数组,但这不是我需要的。

【问题讨论】:

你的例子很抽象——你能提供真实的例子和模式定义吗?您需要考虑在 MergeTree 中分配正确的主键分区等或依赖 AggregatingMergeTree 的能力。 【参考方案1】:

我建议以下几点:

在可能的情况下,最好使用SimpleAggregateFunction 而不是AggregateFunction

使用uniqCombined/uniqCombined64,与uniq相比,“消耗的内存少了几倍”

减少聚合视图中的维度计数(看起来 ProductNameVersion 可以省略)

CREATE MATERIALIZED VIEW StatsAggregated (
  Date Date,
  Name String,
  ErrorCode Int32
  UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
ORDER BY (Date, Name, ErrorCode) AS
SELECT Date, Name, ErrorCode, uniqState(Uid) AS UniqUsers,
FROM StatsFull
GROUP BY Date, Name, ErrorCode;
为结果查询的when子句添加额外的“启发式”约束
SELECT Name, uniqMerge(UniqUsers) uniqUsers 
FROM StatsAggregated 
WHERE Date > subtractDays(toDate(now()), 1)
  AND uniqUsers > 12345 /* <-- 12345 is 'heuristic' number that you evaluate based on your data */
  AND ErrorCode = 0 /* apply any other conditions to narrow the result set as short as possible */
GROUP BY Name
ORDER BY uniqUsers DESC LIMIT 100
使用sampling

/* Raw-table */

CREATE TABLE StatsFull (
 /* .. */
) ENGINE = MergeTree() 
PARTITION BY toMonday(Date)
SAMPLE BY xxHash32(Uid) /* < -- */
ORDER BY Time, xxHash32(Uid)

/* Applying sampling to raw-table can make faster the short-term queries (period in several hours etc) */

SELECT Name, uniq(PcId) as cnt 
FROM StatsFull
SAMPLE 0.05 /* <-- */
WHERE Time > subtractHours(now(), 6) /* <-- hours-period */
GROUP BY Name 
ORDER BY cnt DESC LIMIT 100


/* Aggregated-table */

CREATE MATERIALIZED VIEW StatsAggregated (
  Date Date,
  ProductName String,
  ErrorCode Int32,
  Name String,
  Version String,
  UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() 
PARTITION BY toMonday(Date)
SAMPLE BY intHash32(toInt32(Date)) /* < -- not sure that is good to choose */
ORDER BY (intHash32(toInt32(Date)), ProductName, ErrorCode, Name, Version)
SELECT /* .. */ FROM StatsFull GROUP BY /* .. */**

/* Applying sampling to aggregated-table can make faster the long-term queries (period in several weeks, months etc) */

SELECT Name 
FROM StatsAggregated 
SAMPLE 0.1 /* < -- */
WHERE Date > subtractMonths(toDate(now()), 3) /* <-- months-period */
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
使用distributed query processing。将数据分成几个部分(分片)允许进行分布式处理;使用distributed_group_by_no_merge-query 设置可以进一步提高处理性能。

【讨论】:

我会尝试你的建议。无法减少聚合视图中的列数,因为需要按 ProductName 和 Version 进行过滤。看来我需要采样。【参考方案2】:

如果您需要将数组转换为行,您可以使用 arrayJoin

SELECT Name, arrayJoin(topK(100)(Count)) AS top100_Count FROM Stats

【讨论】:

以上是关于在 Clickhouse 中按运算符获取前 n 行无顺序的主要内容,如果未能解决你的问题,请参考以下文章

如何在clickhouse中按时间顺序折叠相同的值行?

如何从数据框中按降序获得前 n 家公司

在矩阵中按行获取所有可能的组合

大数据ClickHouse进阶(十五):ClickHouse的LIMIT BY和 LIMIT子句

在 Laravel 5 中按行排序并限制结果

在一个查询中按计数和运算符获取结果