clickhouse实践关于clickhouse对空值的处理总结

Posted 2021-08-11 扫地增

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了clickhouse实践关于clickhouse对空值的处理总结相关的知识，希望对你有一定的参考价值。

1 背景

在工作中，我们在使用spark dataset向clickhouse向表中批量插入数据时，经常遇到某个字段为NULL导致导数任务失败。而我们在clickhouse按照正常方式建表时，我们并不能保证每条插入的数据的每个字段都是非NULL值。基于这个背景我们建表、查询、关联三个方面沉淀下clickhouse中应该如何处理NULL给我们带来的困扰。

2 建表空值处理

2.1 问题描述

下面我们正常按官网提供的建表语句，正常进行建表语句如下：

CREATE TABLE dw.dw_user_phone_detail(
  `user_id` String COMMENT '用户id',
  `user_name` String COMMENT '用户姓名',
  `phone` String COMMENT '用户电话',
  `create_time` DateTime COMMENT '数据入表时间',
  `update_time` DateTime COMMENT '数据修改时间'
) 
ENGINE = MergeTree 
PARTITION BY user_id 
ORDER BY user_id
SETTINGS index_granularity = 8192;

这个时候，如果将包含空值的数据，插入到表中时，会报错。

DB::Exception: Expression returns value NULL, that is out of range of type String, at: null)

我们发现clickhouse告诉我们‘表达式返回值NULL，这超出了String类型的范围’。这个时候我们知道clickhouse在默认情况下数据类型中是不支持NULL这个值的。

2.2 解决方案

经过研究们发现clickhouse提供了为NULL单独提供了一中类型是Nullable。那么问题的解决无非就是两种思路：

从clickhouse本身出发，让clickhouse中的各个数据类型支持NULL值。
从数据角度出发，消除数据中的null值为数据设置默认值。

2.2.1 clickhouse 本身解决

从clickhouse本身解决无疑就是研究如何利用Nullable这种类型。我们先来看这种数据类型。

什么是Nullable
1. 用法：Nullable(TypeName),表示某类型可以支持存储NULL。
2. 允许用特殊标记 (NULL) 表示缺失值，可以与 TypeName 的正常值存放一起。 例如，Nullable(Int8) 类型的列可以存储 Int8 类型值，而没有值的行将存储 NULL。对于 TypeName，不能使用复合数据类型阵列和元组。
3. 复合数据类型可以包含 Nullable 类型值，例如Array(Nullable(Int8))。
4. Nullable 类型字段不能包含在表索引中。 这就决定了在clickhouse表中主键决不能使用Nullable做修饰，因为我们知道clickhouse中主键自身是被添加索引的。
5. 除非在 ClickHouse 服务器配置中另有说明，否则 NULL 是任何 Nullable 类型的默认值。
应用Nullable：

CREATE TABLE dw.dw_user_phone_detail(
  `user_id` String COMMENT '用户id',
  `user_name` Nullable(String) COMMENT '用户姓名',
  `phone` Nullable(String) COMMENT '用户电话',
  `create_time` Nullable(DateTime) COMMENT '数据入表时间',
  `update_time` Nullable(DateTime) COMMENT '数据修改时间'
) 
ENGINE = MergeTree 
PARTITION BY user_id 
ORDER BY user_id
SETTINGS index_granularity = 8192;

注： 需要特别注意的是表中的主键不可加Nullable修饰否则违背第四条定义。因此主键是不可以包含空值的，如果把主键加上Nullable建表时就会报错。

2.2.2 数据角度解决

相比从clickhouse角度考虑，从数据角度考虑消除空值显得不是那么一劳永逸，我们就需要不厌其烦的在数据入库时给每条数据的NULL设置默认值。
其实不同关系型或者类关系型数据源都支持设置判断字段是否为NULL设置默认值的函数如：mysql中IFNULL(expr1,expr2)，Hive中的Nvl(expr1,expr2),但是针对不同的数据源入库我们难以对所有数据源的设置的类似功能的函数做出一致要求这里我们推荐两种方法进行处理：

函数，当然看个人喜好，这里推荐灵活度较高的
1. coalesce(expr1,expr2,expr3,....)该函数返回所有设置表达式中的第一个非空表达式的结果。
2. if(expr1,value1,value2) 该函数的第一个参数是表达式完全可以进行自定义使用。
关键字，当然CASE WHEN THEN ELSE END了。
实例：

SELECT
  if(user_id is not null,user_id,'@') AS user_id,
  if(user_name is not null,user_name,'@') AS user_name,
  if(phone is not null,phone,'@') AS phone,
  if(create_time is not null,date_format(create_time,'yyyy-MM-dd hh:mm:ss'),date_format(CURRENT_DATE,'yyyy-MM-dd hh:mm:ss')) AS create_time,
  if(update_time is not null,date_format(update_time,'yyyy-MM-dd hh:mm:ss'),date_format(CURRENT_DATE,'yyyy-MM-dd hh:mm:ss')) AS update_time
FROM dm.dm_user_phone_detail

注： 值得说的是这种方式，使用前提是在不影响数据使用和数理逻辑的情况下。

3 查询空值处理

3.1 问题分析

上面说了建表的问题，接下来要实例一下，当我们表已经建好，且表数据已经有了，一列数据既包含NULL，又包含''这类空值，这个时候，如果不注意语法，会报错，如果包含这两类数据，不能使用coalesce，如下：

select coalesce(user_name, 0) as user_name_a,
       count(distinct phone) as ph_cnt
from dw.dw_user_phone_detail
 group by user_name_a

报错如下：错误原因是user_name是String类型，不可以改成UInt8类型

Code: 386, e.displayText() = DB::Exception: There is no supertype for types String, UInt8 because some of them are String/FixedString and some of them are not (version 19.17.6.36 (official build))

3.2 解决方案

这里有一个小的知识点：

group by后面的名称，可以写select中的逻辑，也可以写as为别名，下面使用case when改写上面的内容

--方式一
select case when user_name is null or user_name = '' then 'null' else user_name end as user_name,
       count(distinct phone) as ph_cnt
from dw.dw_user_phone_detail
group by user_name
 
--方式二
select case when user_name is null or paymentterm = '' then 'null' else user_name end as user_name,
       count(distinct phone) as ph_cnt
from dw.dw_user_phone_detail
group by case when user_name is null or paymentterm = '' then 'null' else user_name end
 
--方式三
select coalesce(user_name,'null') as user_name,
       count(distinct phone) as ph_cnt
from dw.dw_user_phone_detail
group by coalesce(user_name,'null')

这几种方式都是可以达到效果的。

4.关联问题

如下场景，需要使用a表关联b表，把a和b都有的id剔除，在hive中我们一般这样实现：

select a.* 
from a 
left join b 
on a.id = b.id 
where b.id is null

不过这种方式，在CK中是有问题的，要借用coalesce来完成

select a.* 
from a 
left join b 
on a.id = b.id 
 where coalesce(b.id,0) = 0

5 总结

对于clickhouse中的NULL，处理接简单总结到这里，对于建表和查询我们进行了比较详细的分析，对关联的处理我们简单的举了一个实例，这个还需要大家在实际实际场景进行灵活变通与应用。希望能够帮大家。

以上是关于clickhouse实践关于clickhouse对空值的处理总结的主要内容，如果未能解决你的问题，请参考以下文章