源码Flink sql 流式去重源码解析

Posted 2021-03-28 Flink菜鸟

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了源码Flink sql 流式去重源码解析相关的知识，希望对你有一定的参考价值。

本文基于 flink 1.12.0

之前尝试了一下 flink sql 的去重和Top n 功能，只是简单的看了下官网，然后用 sql 实现了功能，但是还有些疑问没有解决。比如：不使用 mini-batch 模式，去重的结果很单一，降序就只输出第一条数据（升序就一直输出最后一条）

为了解决这些疑问，特意研究了下去重部分的源码类结构图如下：

去重基类

DeduplicateFunctionBase 定义了去重的状态，由于是去重，所以只需要一个 ValueState 存储一个 Row 的数据就可以了（不管是处理时间还是事件时间，数据上都有）

// state stores previous message under the key. 基于key 的去重状态protected ValueState<T> state;
public DeduplicateFunctionBase( TypeInformation<T> typeInfo, TypeSerializer<OUT> serializer, long stateRetentionTime) { this.typeInfo = typeInfo; // 状态保留时间，决定去重的数据的作用范围 this.stateRetentionTime = stateRetentionTime;  this.serializer = serializer;}
@Overridepublic void open(Configuration configure) throws Exception { super.open(configure); ValueStateDescriptor<T> stateDesc = new ValueStateDescriptor<>("deduplicate-state", typeInfo); // 设置去重状态的 ttl（这个很重要） StateTtlConfig ttlConfig = createTtlConfig(stateRetentionTime); // 如果 ttl 是开启的 if (ttlConfig.isEnabled()) { stateDesc.enableTimeToLive(ttlConfig); } // 创建去重状态 state = getRuntimeContext().getState(stateDesc);}

处理时间的 First Row

ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY process_time asc) AS rownum

即取基于处理时间的第一条数据

处理类为：ProcTimeDeduplicateKeepFirstRowFunction

处理时间的逻辑判断基于处理时间特性，后一条一定比前一条大这个逻辑，直接判断去重 state.value 是否为空，为空则表示是第一条数据，输出；不为空则前面有数据，不是第一条，不输出

public class ProcTimeDeduplicateKeepFirstRowFunction extends DeduplicateFunctionBase<Boolean, RowData, RowData, RowData> {
 private static final long serialVersionUID = 5865777137707602549L;
 // state stores a boolean flag to indicate whether key appears before. public ProcTimeDeduplicateKeepFirstRowFunction(long stateRetentionTime) { super(Types.BOOLEAN, null, stateRetentionTime); }
 @Override public void processElement(RowData input, Context ctx, Collector<RowData> out) throws Exception { // 调用处理时间的判断方法：DeduplicateFunctionHelper.processFirstRowOnProcTime processFirstRowOnProcTime(input, state, out); }}

DeduplicateFunctionHelper.processFirstRowOnProcTime

static void processFirstRowOnProcTime( RowData currentRow, ValueState<Boolean> state, Collector<RowData> out) throws Exception {
 // 检查当前行是 insert only 的，不然抱错 checkInsertOnly(currentRow); // ignore record if it is not first row // 状态不为为空，说明不是处理时间的第一条，不输出，返回 if (state.value() != null) { return; } // 第一条添加状态 state.update(true); // emit the first row which is INSERT message // 输出第一条数据 out.collect(currentRow); }

处理时间的 Last Row

ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY process_time desc) AS rownum

即取基于处理时间的最后一条数据

处理时间的逻辑基于处理时间特性，后一条一定比前一条大这个逻辑，直接判断去重 state.value 是否为空，为空则表示是第一条数据，直接输出，不为空则前面有数据，判断是否更新上一条数据，并输出当前数据；

处理类为：ProcTimeDeduplicateKeepFirstRowFunction， Last row 有点不同的是，如果接收的 cdc 源，是可以支持删除前一条数据的（这里不讨论）

public class ProcTimeDeduplicateKeepLastRowFunction extends DeduplicateFunctionBase<RowData, RowData, RowData, RowData> {
 private static final long serialVersionUID = -291348892087180350L; private final boolean generateUpdateBefore; private final boolean generateInsert; private final boolean inputIsInsertOnly;
 public ProcTimeDeduplicateKeepLastRowFunction( InternalTypeInfo<RowData> typeInfo, long stateRetentionTime, boolean generateUpdateBefore, boolean generateInsert, boolean inputInsertOnly) { super(typeInfo, null, stateRetentionTime); this.generateUpdateBefore = generateUpdateBefore; this.generateInsert = generateInsert; // StreamExecChangelogNormalize 处理的时候会设置为 false，StreamExecDeduplicate 设置为 true this.inputIsInsertOnly = inputInsertOnly; }
 @Override public void processElement(RowData input, Context ctx, Collector<RowData> out) throws Exception { // 判断是否是 insert only 的 if (inputIsInsertOnly) { // 只 insert 的 DeduplicateFunctionHelper processLastRowOnProcTime(input, generateUpdateBefore, generateInsert, state, out); } else { // changlog 会发出删除命令，删除前一条数据 DeduplicateFunctionHelper processLastRowOnChangelog(input, generateUpdateBefore, state, out); } }}

DeduplicateFunctionHelper.processLastRowOnProcTime

/** * Processes element to deduplicate on keys with process time semantic, sends current element as last row, * retracts previous element if needed. * * @param currentRow latest row received by deduplicate function * @param generateUpdateBefore whether need to send UPDATE_BEFORE message for updates * @param state state of function, null if generateUpdateBefore is false * @param out underlying collector */static void processLastRowOnProcTime( RowData currentRow, boolean generateUpdateBefore, boolean generateInsert, ValueState<RowData> state, Collector<RowData> out) throws Exception {
 // 检测为只写的 checkInsertOnly(currentRow); // 是否更新上一条数据，是否写数据 if (generateUpdateBefore || generateInsert) { // use state to keep the previous row content if we need to generate UPDATE_BEFORE // or use to distinguish the first row, if we need to generate INSERT // 取去重状态数据 RowData preRow = state.value(); state.update(currentRow); // 没有上一条，直接输出当前这条 if (preRow == null) { // the first row, send INSERT message 输出第一条数据是 INSERT currentRow.setRowKind(RowKind.INSERT); out.collect(currentRow); } else { // 如果存在上一条数据，配置为更新上一条，会输出上一条数据(方便下游可以更新就的数据) if (generateUpdateBefore) { preRow.setRowKind(RowKind.UPDATE_BEFORE); out.collect(preRow); } // 再输出当前数据 currentRow.setRowKind(RowKind.UPDATE_AFTER); out.collect(currentRow); } } else { // 如果不更新上一条，不是 insert，就输出一个 更新 // always send UPDATE_AFTER if INSERT is not needed currentRow.setRowKind(RowKind.UPDATE_AFTER); out.collect(currentRow); }}

事件时间的去重

事件时间的代码和处理时间的代码不同，将取第一条和最后一条合并在了一起，用了个 boolean 值的变量 “keepLastRow” 标识

事件时间去重类

public class RowTimeDeduplicateFunction extends DeduplicateFunctionBase<RowData, RowData, RowData, RowData> {
 private static final long serialVersionUID = 1L;
 private final boolean generateUpdateBefore; private final boolean generateInsert; private final int rowtimeIndex; private final boolean keepLastRow;
 public RowTimeDeduplicateFunction( InternalTypeInfo<RowData> typeInfo, long minRetentionTime, int rowtimeIndex, boolean generateUpdateBefore, boolean generateInsert, boolean keepLastRow) { super(typeInfo, null, minRetentionTime); // 是否更新前一条 this.generateUpdateBefore = generateUpdateBefore; // 是否是 INSERT this.generateInsert = generateInsert; // 事件时间列的 index this.rowtimeIndex = rowtimeIndex; // 保留第一条还是最后一条 this.keepLastRow = keepLastRow; }
 @Override public void processElement(RowData input, Context ctx, Collector<RowData> out) throws Exception { deduplicateOnRowTime( state, input, out, generateUpdateBefore, generateInsert, rowtimeIndex, keepLastRow); }
 /** * Processes element to deduplicate on keys with row time semantic, sends current element if it is last * or first row, retracts previous element if needed. * * @param state state of function * @param currentRow latest row received by deduplicate function * @param out underlying collector * @param generateUpdateBefore flag to generate UPDATE_BEFORE message or not * @param generateInsert flag to gennerate INSERT message or not * @param rowtimeIndex the index of rowtime field * @param keepLastRow flag to keep last row or keep first row */ public static void deduplicateOnRowTime( ValueState<RowData> state, RowData currentRow, Collector<RowData> out, boolean generateUpdateBefore, boolean generateInsert, int rowtimeIndex, boolean keepLastRow) throws Exception { checkInsertOnly(currentRow); RowData preRow = state.value();
 if (isDuplicate(preRow, currentRow, rowtimeIndex, keepLastRow)) { // 不是重复的，判断更新重复数据 updateDeduplicateResult( generateUpdateBefore, generateInsert, preRow, currentRow, out); // 将当前数据写到状态中 state.update(currentRow); } }}

事件时间判断重复方法

static boolean isDuplicate(RowData preRow, RowData currentRow, int rowtimeIndex, boolean keepLastRow) { if (keepLastRow) { // 保留最后一条：去重状态为 null, 上一条数据时间 <= 当前数据的 时间  return preRow == null || getRowtime(preRow, rowtimeIndex) <= getRowtime(currentRow, rowtimeIndex); } else { // 保留第一条：去重状态为 null, 当前数据时间 < 上一条数据的 时间  return preRow == null || getRowtime(currentRow, rowtimeIndex) < getRowtime(preRow, rowtimeIndex); }}// 只反序列化 事件时间列private static long getRowtime(RowData input, int rowtimeIndex) { return input.getLong(rowtimeIndex); }

DeduplicateFunctionHelper.updateDeduplicateResult

static void updateDeduplicateResult( boolean generateUpdateBefore, boolean generateInsert, RowData preRow, RowData currentRow, Collector<RowData> out) {
 // 更新前面的一条 或 是 INSERT if (generateUpdateBefore || generateInsert) { // 前一条数据为 null if (preRow == null) { // the first row, send INSERT message 直接输出 INSERT currentRow.setRowKind(RowKind.INSERT); out.collect(currentRow); } else { // 如果要更新上一条数据 if (generateUpdateBefore) { final RowKind preRowKind = preRow.getRowKind(); // 上一条数据的状态设为 UPDATE_BEFORE preRow.setRowKind(RowKind.UPDATE_BEFORE); out.collect(preRow); preRow.setRowKind(preRowKind); } // 输出当前数据 状态：UPDATE_AFTER currentRow.setRowKind(RowKind.UPDATE_AFTER); out.collect(currentRow); } } else { // 输出当前数据 状态：UPDATE_AFTER currentRow.setRowKind(RowKind.UPDATE_AFTER); out.collect(currentRow); } }

从代码可以清楚的看到去重的逻辑，需要注意的是去重状态是有有 ttl 的，ttl 的默认时间是 36000 s，所以默认情况下，取第一条的情况下，在状态还没过期的情况下，只会在启动的时候输出一条数据（这时候会给人一种是基于全局去重的错觉）。
调整状态的时间可以设置参数：table.exec.state.ttl=60s 参见代码：DeduplicateFunctionBase 成员变量 stateRetentionTime

以上是关于源码Flink sql 流式去重源码解析的主要内容，如果未能解决你的问题，请参考以下文章

flink的流式wordcount代码解析

flink入门-流式计算概念

爬虫5 scrapy框架2 全站爬取cnblogs, scarpy请求传参, 提高爬取效率, 下载中间件, 集成selenium, fake-useragent, 去重源码分析, 布隆过滤器,

flink作业提交源码解析 - StreamGraph的生成