■ Multidimensional information systems

■ Data warehousing

■ Databases

■ Decision support systems (DSS)

■ Executive information systems (EIS)

■ Business intelligence (BI)

■ Business analytics

■ Data mining

■ Data visualization

■ Knowledge management (KM)

What is ClickHouse ?

ClickHouse® is an open-source column-oriented database management system that allows generating analytical data reports in real-time.

A fast analytical DBMS.

ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time.

Its technology works 100-1000x faster than traditional database management systems, and processes hundreds of millions to over a billion rows and tens of gigabytes of data per server per second. With a widespread user base around the globe, the technology has received praise for its reliability, ease of use, and fault tolerance.



Parser和Interpreter是非常重要的两组接口:Parser分析器是将sql语句已递归的方式形成AST语法树的形式,并且不同类型的sql都会调用不同的parse实现类。而Interpreter解释器则负责解释AST,并进一步创建查询的执行管道。Interpreter解释器的作用就像Service服务层一样,起到串联整个查询过程的作用,它会根据解释器的类型,聚合它所需要的资源。首先它会解析AST对象;然后执行"业务逻辑" ( 例如分支判断、设置参数、调用接口等 );最终返回IBlock对象,以线程的形式建立起一个查询执行管道。


表引擎是ClickHouse的一个显著特性,上文也有提到,clickhouse有很多种表引擎。不同的表引擎由不同的子类实现。表引擎是使用IStorage接口的,该接口定义了DDL ( 如ALTER、RENAME、OPTIMIZE和DROP等 ) 、read和write方法,它们分别负责数据的定义、查询与写入。




Column和Field是ClickHouse数据最基础的映射单元。作为一款百分之百的列式存储数据库,ClickHouse按列存储数据,内存中的一列数据由一个Column对象表示。Column对象分为接口和实现两个部分,在IColumn接口对象中,定义了对数据进行各种关系运算的方法,例如插入数据的insertRangeFrom和insertFrom方法、用于分页的cut,以及用于过滤的filter方法等。而这些方法的具体实现对象则根据数据类型的不同,由相应的对象实现,例如ColumnString、ColumnArray和ColumnTuple等。在大多数场合,ClickHouse都会以整列的方式操作数据,但凡事也有例外。如果需要操作单个具体的数值 ( 也就是单列中的一行数据 ),则需要使用Field对象,Field对象代表一个单值。与Column对象的泛化设计思路不同,Field对象使用了聚合的设计模式。在Field对象内部聚合了Null、UInt64、String和Array等13种数据类型及相应的处理逻辑。


ClickHouse内部的数据操作是面向Block对象进行的,并且采用了流的形式。虽然Column和Filed组成了数据的基本映射单元,但对应到实际操作,它们还缺少了一些必要的信息,比如数据的类型及列的名称。于是ClickHouse设计了Block对象,Block对象可以看作数据表的子集。Block对象的本质是由数据对象、数据类型和列名称组成的三元组,即Column、DataType及列名称字符串。Column提供了数据的读取能力,而DataType知道如何正反序列化,所以Block在这些对象的基础之上实现了进一步的抽象和封装,从而简化了整个使用的过程,仅通过Block对象就能完成一系列的数据操作。在具体的实现过程中,Block并没有直接聚合Column和DataType对象,而是通过ColumnWith TypeAndName对象进行间接引用。





























Key Properties of OLAP Scenario

The vast majority of requests are for read access.

Data is updated in fairly large batches (> 1000 rows), not by single rows; or it is not updated at all.

Data is added to the DB but is not modified.

For reads, quite a large number of rows are extracted from the DB, but only a small subset of columns.

Tables are “wide,” meaning they contain a large number of columns.

Queries are relatively rare (usually hundreds of queries per server or less per second).

For simple queries, latencies around 50 ms are allowed.

Column values are fairly small: numbers and short strings (for example, 60 bytes per URL).

Requires high throughput when processing a single query (up to billions of rows per second per server).

Transactions are not necessary.

Low requirements for data consistency.

There is one large table per query. All tables are small, except for one.

A query result is significantly smaller than the source data. In other words, data is filtered or aggregated, so the result fits in a single server’s RAM.

It is easy to see that the OLAP scenario is very different from other popular scenarios (such as OLTP or Key-Value access). So it does not make sense to try to use OLTP or a Key-Value DB for processing analytical queries if you want to get decent performance. For example, if you try to use MongoDB or Redis for analytics, you will get very poor performance compared to OLAP databases.
















The different meanings of OLAP:

OLAP:On-Line Analytical Processing

OLAP 使用户几乎可以立即访问来自多维数据仓库的信息,以他们喜欢的任何方式查看信息,并清晰地指定和执行复杂的计算。

OLAP enables users to access information from multidimensional data warehouses almost instantly, to view information in any way they like, and to cleanly specify and carry out sophisticated calculations.

The Functional Requirements of OLAP Systems

In short, the functional requirements for OLAP are as follows:

■ 具有多层次参考的丰富维度结构

■ 高效的维度规范和维度计算

■ 结构(structure)和表示(representation)的分离

■ 灵活性

■ 快速即席分析

■ 多用户支持

■ Rich dimensional structuring with hierarchical referencing

■ Efficient specification of dimensions and dimensional calculations

■ Separation of structure and representation

■ Flexibility

■ Sufficient speed to support ad hoc analysis

■ Multi-user support

这些需求源于信息处理的永恒目标,决策支持与事务处理的不同优化需求、描述性建模与面向决策的信息处理的其他递归阶段的不同功能需求、不同计算架构层之间的应用范围区别,以及在全球 2000 强企业范围内经常发现的挑战。

These requirements derive from the timeless goals of information processing, the

distinct optimization requirements for decision support versus transaction processing,

the distinct functional requirements for descriptive modeling versus other recursive

stages of decision-oriented information processing, the application range distinction

between different layers of computing architectures, and the challenges frequently

found within the boundaries of Global 2000 corporations.


In a “normal” row-oriented DBMS, data is stored in this order:

In other words, all the values related to a row are physically stored next to each other.

Examples of a row-oriented DBMS are mysql, Postgres, and MS SQL Server.

In a column-oriented DBMS, data is stored like this:



package ck

class ColumnStorageTable

    var WatchID: List<String> = listOf()

    var JavaEnable: List<String> = listOf()

    var Title: List<String> = listOf()

    var GoodEvent: List<String> = listOf()

    var EventTime: List<String> = listOf()

class ColumnStorageTableData

    var columnStorageTable = ColumnStorageTable()


package ck

class RowStorageTable

    var WatchID: String = ""

    var JavaEnable: String = ""

    var Title: String = ""

    var GoodEvent: String = ""

    var EventTime: String = ""

class RowStorageTableData

    val lines = listOf<RowStorageTable>()




1. 输入/输出





2. CPU 中央处理器





这不是在"normal" 数据库中完成的,因为它在运行简单查询时没有意义。但是,也有例外。例如,MemSQL使用代码生成来减少处理SQL查询时的延迟。(为了进行比较,分析型DBMS需要优化吞吐量,而不是延迟。)


See the difference?


For an analytical query, only a small number of table columns need to be read. In a column-oriented database, you can read just the data you need. For example, if you need 5 columns out of 100, you can expect a 20-fold reduction in I/O.

Since data is read in packets, it is easier to compress. Data in columns is also easier to compress. This further reduces the I/O volume.

Due to the reduced I/O, more data fits in the system cache.

For example, the query “count the number of records for each advertising platform” requires reading one “advertising platform ID” column, which takes up 1 byte uncompressed. If most of the traffic was not from advertising platforms, you can expect at least 10-fold compression of this column. When using a quick compression algorithm, data decompression is possible at a speed of at least several gigabytes of uncompressed data per second. In other words, this query can be processed at a speed of approximately several billion rows per second on a single server. This speed is actually achieved in practice.


Since executing a query requires processing a large number of rows, it helps to dispatch all operations for entire vectors instead of for separate rows, or to implement the query engine so that there is almost no dispatching cost. If you do not do this, with any half-decent disk subsystem, the query interpreter inevitably stalls the CPU. It makes sense to both store data in columns and process it, when possible, by columns.

There are two ways to do this:

(1)A vector engine. All operations are written for vectors, instead of for separate values. This means you do not need to call operations very often, and dispatching costs are negligible. Operation code contains an optimized internal cycle.

(2)Code generation. The code generated for the query has all the indirect calls in it.

This is not done in “normal” databases, because it does not make sense when running simple queries. However, there are exceptions. For example, MemSQL uses code generation to reduce latency when processing SQL queries. (For comparison, analytical DBMSs require optimization of throughput, not latency.)

Note that for CPU efficiency, the query language must be declarative (SQL or MDX), or at least a vector (J, K). The query should only contain implicit loops, allowing for optimization.


使用原生的 ClickHouse,在大数据量的时候会发生很多问题:

1.稳定性:ClickHouse 的原始稳定性并不好,比如说:在高频写入的场景下经常会出现 too many part 等问题,整个集群被一个慢查询拖死,节点 OOM、DDL 请求卡死都比较常见。另外,由于 ClickHouse 原始设计缺陷,随数据增长的依赖的 zookeeper 瓶颈一直存在,无法很好解决;微信后期进行多次内核改动,才使得它在海量数据下逐步稳定下来,部分 issue 也贡献给了社区。

2.使用门槛较高:会用 ClickHouse 的,跟不会用 ClickHouse 的,其搭建的系统业务性能可能要差 3 倍甚至 10 倍,有些场景更需要针对性对内核优化。


要想比较好地解决 ClickHouse 易用性和稳定性,需要生态支撑,整体的生态方案有以下几个重要的部分:


Data gateway, responsible for intelligent caching, interception of large queries, and current limiting;

2.Sinker:离线/在线高性能接入层,负责削峰、hash 路由,流量优先级,写入控频;

Offline/online high-performance access layer, responsible for peak cutting, hash routing, traffic priority, and write frequency control;


Responsible for cluster management, data balancing, disaster recovery switching, and data migration;

4.Monitor:负责监控报警,亚健康检测,查询健康度分析,可与 Manager 联动;

Responsible for monitoring and alerting, sub-health testing, and health analysis, and can be linked with Manager;




The daily data throughput is in the billions, and in terms of real-time access, the problem of peak traffic is better solved through token and back pressure solutions.

(2)通过 Hash 路由接入,使数据落地了之后可直接做 Join,无需 shuffle 实现的更快 Join 查询,在接入上也实现了精确一次。

Access is routed through Hash, so that Joins can be made directly after the data is landed. There is no need for the faster Join query implemented by shuffle, and the access is also accurate once.

(3)离线同步方案上,通过预构 Merge 成建成 Part,再送到线上的服务节点,这其实是种读写分离的思想,更便于满足高一致性、高吞吐的场景要求。

 In the offline synchronization scheme, the pre-constructed Merge is used to form a Part, and then sent to the online service node. This is actually a kind of read-write separation idea, which is easier to meet the requirements of high-consistency and high-throughput scenarios.


ClickHouse 整个的设计哲学,要求在特定的场景下,采用特定的功能特性(选择合理的表引擎,合理分区,分片,冷热分离,排序键拆分,合理压缩LZ4等),才能得到最极致的性能。

基于共建的 ClickHouse 生态,应用于典型应用场景:

1.BI 分析/看板:由于科学探索是随机的,很难通过预构建的方式来解决,用 Hadoop 的生态只能实现小时到分钟的级别。使用 ClickHouse 单表万亿的数据量,查询 P95 在 5 秒以内。数据科学家现在想做一个验证,非常快就可以实现。

2.A/B 实验平台:早期做 A/B 实验的时候,前一天晚上要把所有的实验统计结果,预先聚合好,第二天才能查询实验结果。在单表数据量级千亿 / 天、大表实时 Join 的场景下,,从离线到实时分析的飞跃,使得 P95 响应<3S,A/B 实验结论更加准确,实验周期更短 ,模型验证更快。

3.实时特征计算:虽然大家普遍认为 ClickHouse 不太擅长解决实时相关的问题,但最终通过优化,可以做到扫描量数十亿,全链路时延<3 秒,P95 响应近 1 秒。


集群规模1000台机器,数据量 PB 级,每天的查询量上百万,单集群 TPS 达到了亿级,而查询耗时均值仅需秒级返回。

ClickHouse OLAP 的生态相对于之前的 Hadoop 生态,性能提升了 10 倍以上,通过流批一体提供更稳定可靠的服务,使得业务决策更迅速,实验结论更准确。


ClickHouse 原始的设计和 Shard-Nothing 的架构,无法很好地实现秒级伸缩与 Join 的场景;实现存算分离的云原生数仓可以解决这个问题。


2.稳定性:无 ZK 瓶颈,读写易分离,异地容灾;


4.功能全:专注于查询优化与 Cache 策略、支持高效多表 Join;


Clickhouse拥有分布式能力,自然支持数据分片,数据分片是将数据进行横向切分,这是一种在面对海量数据的场景下,解决存储和查询瓶颈的有效手段。ClickHouse并不像其他分布式系统那样,拥有高度自动化的分片功能。ClickHouse提供了本地表 ( Local Table ) 与分布式表 ( Distributed Table ) 的概念。一张本地表等同于一份数据的分片。而分布式表本身不存储任何数据,它是本地表的访问代理,其作用类似分库中间件。借助分布式表,能够代理访问多个数据分片,从而实现分布式查询。

ClickHouse 稀疏索引

左边的结构图为/var/lib/clickhouse/data/default(schema)/(tablename)/.bin(列文件) .mrk(块偏移量) primary.idx主键索引




每隔8192行数据,是1个block 主键会每隔8192,取一行主键列的数据,同时记录这是第几个block 查询的时候,如果有索引,就通过索引定位到是哪个block,然后找到这个block对应的mrk文件 mrk文件里记录的是某个block的数据集,在整列bin文件的哪个物理偏移位置 加载数据到内存,之后并行化过滤 索引长度越低,索引在内存中占的长度越小,排序越快,然而区分度就越低。这样不利于查找。索引长度越长,区分度就高,虽然利于查找了,但是索引在内存中占得空间就多了。





Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps.

Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. 



