reduce连接是怎么按组合键分组聚合功能原理详解

Posted 2021-03-08 bclshuai

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了reduce连接是怎么按组合键分组聚合功能原理详解相关的知识，希望对你有一定的参考价值。

1.reduce连接实现目标

气象站数据集，气象站id和名称数据表

StationId StationName

1~hangzhou

2~shanghai

3~beijing

温度记录数据集

StationId TimeStamp Temperature

3~20200216~6

3~20200215~2

3~20200217~8

1~20200211~9

1~20200210~8

2~20200214~3

2~20200215~4

目标：是将上面两个数据集进行连接，将气象站名称按照气象站id加入气象站温度记录中最输出结果：

1~hangzhou ~20200211~9

1~hangzhou ~20200210~8

2~shanghai ~20200214~3

2~shanghai ~20200215~4

3~beijing ~20200216~6

3~beijing ~20200215~2

3~beijing ~20200217~8

2.关键问题：reduce是怎么分组聚合的？

map输出结果会按照组合键第一个字段stationid升序排列，相同stationid的记录按照第二个字段升序排列，气象站数据和记录数据混合再一起，shulfe过程中，map将数据传给reduce，会经过partition分区，相同stationid的数据会被分到同一个reduce，一个reduce中stationid相同的数据会被分为一组。假设采用两个reduce任务，分区按照stationid%2，则分区后的结果为

分区1

<1,0> hangzhou

<1,1> 20200211~9

<1,1> 20200210~8

<3,0> beijing

<3,1> 20200216~6

<3,1> 20200215~2

<3,1> 20200217~8

分区2

<2,0> shanghai

<2,1> 20200214~3

<2,1> 20200215~4

（4）分区之后再将每个分区的数据按照stationid分组聚合

分区1

分组1

<1,0> <Hangzhou, 20200211~9, 20200210~8>

分组2

<3,0> <Beijing, 20200216~6, 20200215~2, 20200217~8>

分区2

<2,0> <shanghai, 20200214~3, 20200215~4>

3.原理剖析

Reduce的源码如下所示，

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package org.apache.hadoop.mapreduce;

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.classification.InterfaceAudience.Public;
import org.apache.hadoop.classification.InterfaceStability.Stable;
import org.apache.hadoop.mapreduce.ReduceContext.ValueIterator;
import org.apache.hadoop.mapreduce.task.annotation.Checkpointable;

@Checkpointable
@Public
@Stable
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    public Reducer() {
    }

    protected void setup(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    }

    protected void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        Iterator i$ = values.iterator();

        while(i$.hasNext()) {
            VALUEIN value = i$.next();
            context.write(key, value);
        }

    }

    protected void cleanup(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
    }

    public void run(Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context context) throws IOException, InterruptedException {
        this.setup(context);

        try {
            while(context.nextKey()) {
                this.reduce(context.getCurrentKey(), context.getValues(), context);
                Iterator<VALUEIN> iter = context.getValues().iterator();
                if (iter instanceof ValueIterator) {
                    ((ValueIterator)iter).resetBackupStore();
                }
            }
        } finally {
            this.cleanup(context);
        }

    }

    public abstract class Context implements ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
        public Context() {
        }
    }
}

经过调试发现shuffer过程中，并不是直接调用我们的实现的Reducer的reduce函数，而是执行了上面标红的run函数，run函数中是调用了context中的nextKey（）函数去遍历分组后的键，然后在将该键对应的值数组传递给this.reduce()函数，也就是我们自己代码里面实现的reduce函数。所以有这个分组的功能。

具体reduce分组排序的示例见下面链接

https://www.cnblogs.com/bclshuai/p/12319490.html

自己开发了一个股票智能分析软件，功能很强大，需要的点击下面的链接获取：

https://www.cnblogs.com/bclshuai/p/11380657.html

以上是关于reduce连接是怎么按组合键分组聚合功能原理详解的主要内容，如果未能解决你的问题，请参考以下文章

Python分组

在$ group聚合中使用$ regex $ reduce，以便显示长度

数据库演练外键SQL语句的编写&分组和聚合函数的组合使用

MongoDB笔记聚合(详细)

按多列分组时熊猫组合键