Flink - CoGroup

Posted fxjwind

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flink - CoGroup相关的知识,希望对你有一定的参考价值。

使用方式,

dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new CoGroupFunction () {...});

 

可以看到coGroup只是产生CoGroupedStreams

    public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
        return new CoGroupedStreams<>(this, otherStream);
    }

 

而where, equalTo只是添加keySelector,对于两个流需要分别指定

keySelector1,keySelector2

 

window设置双流的窗口,很容易理解

 

apply,

       /**
         * Completes the co-group operation with the user function that is executed
         * for windowed groups.
         *
         * <p>Note: This method‘s return type does not support setting an operator-specific parallelism.
         * Due to binary backwards compatibility, this cannot be altered. Use the
         * {@link #with(CoGroupFunction, TypeInformation)} method to set an operator-specific parallelism.
         */
        public <T> DataStream<T> apply(CoGroupFunction<T1, T2, T> function, TypeInformation<T> resultType) {
            //clean the closure
            function = input1.getExecutionEnvironment().clean(function);

            UnionTypeInfo<T1, T2> unionType = new UnionTypeInfo<>(input1.getType(), input2.getType());
            UnionKeySelector<T1, T2, KEY> unionKeySelector = new UnionKeySelector<>(keySelector1, keySelector2);

            DataStream<TaggedUnion<T1, T2>> taggedInput1 = input1 //将input1封装成TaggedUnion,很简单,就是赋值到one上
                    .map(new Input1Tagger<T1, T2>())
                    .setParallelism(input1.getParallelism())
                    .returns(unionType);
            DataStream<TaggedUnion<T1, T2>> taggedInput2 = input2 //将input2封装成TaggedUnion
                    .map(new Input2Tagger<T1, T2>())
                    .setParallelism(input2.getParallelism())
                    .returns(unionType);

            DataStream<TaggedUnion<T1, T2>> unionStream = taggedInput1.union(taggedInput2); //由于现在双流都是TaggedUnion类型,union成一个流,问题被简化

            // we explicitly create the keyed stream to manually pass the key type information in
            WindowedStream<TaggedUnion<T1, T2>, KEY, W> windowOp = //创建窗口
                    new KeyedStream<TaggedUnion<T1, T2>, KEY>(unionStream, unionKeySelector, keyType)
                    .window(windowAssigner);

            if (trigger != null) { //如果有trigger,evictor,设置上
                windowOp.trigger(trigger);
            }
            if (evictor != null) {
                windowOp.evictor(evictor);
            }

            return windowOp.apply(new CoGroupWindowFunction<T1, T2, T, KEY, W>(function), resultType); //调用window的apply
        }

关键理解,他要把两个流变成一个流,这样问题域就变得很简单了

最终调用到WindowedStream的apply,apply是需要保留window里面的所有原始数据的,和reduce不一样

apply的逻辑,是CoGroupWindowFunction

 

private static class CoGroupWindowFunction<T1, T2, T, KEY, W extends Window>
            extends WrappingFunction<CoGroupFunction<T1, T2, T>>
            implements WindowFunction<TaggedUnion<T1, T2>, T, KEY, W> {

        private static final long serialVersionUID = 1L;

        public CoGroupWindowFunction(CoGroupFunction<T1, T2, T> userFunction) {
            super(userFunction);
        }

        @Override
        public void apply(KEY key,
                W window,
                Iterable<TaggedUnion<T1, T2>> values,
                Collector<T> out) throws Exception {

            List<T1> oneValues = new ArrayList<>();
            List<T2> twoValues = new ArrayList<>();

            for (TaggedUnion<T1, T2> val: values) {
                if (val.isOne()) {
                    oneValues.add(val.getOne());
                } else {
                    twoValues.add(val.getTwo());
                }
            }
            wrappedFunction.coGroup(oneValues, twoValues, out);
        }
    }
}

逻辑也非常的简单,就是将该key所在window里面的value,放到oneValues, twoValues两个列表中

最终调用到用户定义的wrappedFunction.coGroup

 

DataStream.join就是用CoGroup实现的

            return input1.coGroup(input2)
                    .where(keySelector1)
                    .equalTo(keySelector2)
                    .window(windowAssigner)
                    .trigger(trigger)
                    .evictor(evictor)
                    .apply(new FlatJoinCoGroupFunction<>(function), resultType);

 

FlatJoinCoGroupFunction

private static class FlatJoinCoGroupFunction<T1, T2, T>
            extends WrappingFunction<FlatJoinFunction<T1, T2, T>>
            implements CoGroupFunction<T1, T2, T> {
        private static final long serialVersionUID = 1L;

        public FlatJoinCoGroupFunction(FlatJoinFunction<T1, T2, T> wrappedFunction) {
            super(wrappedFunction);
        }

        @Override
        public void coGroup(Iterable<T1> first, Iterable<T2> second, Collector<T> out) throws Exception {
            for (T1 val1: first) {
                for (T2 val2: second) {
                    wrappedFunction.join(val1, val2, out);
                }
            }
        }
    }

可以看出当前join是inner join,必须first和second都有的情况下,才会调到用户的join函数

以上是关于Flink - CoGroup的主要内容,如果未能解决你的问题,请参考以下文章

Flink 指定 keys 的几种方法

Flink 指定 keys 的几种方法

Flink 双流 Join 的3种操作示例

Flink双流join的3种方式及IntervalJoin源码分析

Spark join和cogroup算子

Pig : Cogroup 如何避免空白值