Hadoop序列化与Writable源码分析

Posted 2020-06-12

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Hadoop序列化与Writable源码分析相关的知识，希望对你有一定的参考价值。

序列化的概念
1.序列化(Serialization)是指把结构化对象转化为字节流。
   2.反序列化(Deserialization)是序列化的逆过程，即把字节流转回结构化对象

Hadoop序列化的特点
   1.序列化格式特点
——紧凑：高效使用存储空间
——快速：读写数据的额外开销小
——可扩展：可透明地读取老格式的数据
——互操作：支持多语言的交互
注：hadoop1.x的序列化仅满足了紧凑和快速的特点。

   2.序列化在分布式环境的两大作用：进程间通信，永久存储。
3.Hadoop节点间通信
技术分享

这里之所以要提到Writable，是因为Writable接口是hadoop中序列化对象的一个通用接口，在package org.apache.hadoop.io中定义了大量的可序列化对象。

[java] view plain copy

public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

一个类要支持可序列化只需要实现这个接口即可。下面是Writable类的层次结构，如下图：

技术分享

下面我们一点一点来看，先是IntWritable和LongWritable

技术分享

从图中我们可以看到WritableComparable接口继承了Writable和Comparable接口，以支持比较功能。正如层次图中看到，IntWritable、LongWritable、ByteWritable等基本类型都实现了这个接口，IntWritable和LongWritable的readFields()都直接从实现了DataInput接口的输入流中读取二进制数据并重构成int型和long型，而write()方法则直接将int类型数据和long类型数据直接转换成二进制流，IntWritable和LongWritable都含有相应的Comparator内部类(如上图中所示IntWritable聚合了Comparator；LongWritable聚合了Comparator和DecresingComparator)，这是用来支持在不反序列化的情况下直接比较数据流中的数据的功能，这是一个优化，无需反序列化创建对象后再比较。

我们再通过一张草图来理解一下Comparator在这里面的位置：

技术分享

如上图所示，Comparator只是一个比较器，就类似于java集合中Collection和Collections(只是提供了一些处理集合数据的方法的工具类)的关系。WritableComparator并不是继承于WritableComparable，只是在WritableComparator中聚合了WritableComparable的对象而已。

下面我们来看看IntWritable的代码：

[java] view plain copy

public class IntWritable implements WritableComparable {
private int value;
public IntWritable() {}
public IntWritable(int value) { set(value); }
/** Set the value of this IntWritable. */
public void set(int value) { this.value = value; }
/** Return the value of this IntWritable. */
public int get() { return value; }
public void readFields(DataInput in) throws IOException {
value = in.readInt();
}
public void write(DataOutput out) throws IOException {
out.writeInt(value);
}
/** Returns true iff <code>o</code> is a IntWritable with the same value. */
public boolean equals(Object o) {
if (!(o instanceof IntWritable))
return false;
IntWritable other = (IntWritable)o;
return this.value == other.value;
}
public int hashCode() {
return value;
}
/** Compares two IntWritables. */
public int compareTo(Object o) {
int thisValue = this.value;
int thatValue = ((IntWritable)o).value;
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public String toString() {
return Integer.toString(value);
}
/** A Comparator optimized for IntWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(IntWritable.class);
}
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
int thisValue = readInt(b1, s1);
int thatValue = readInt(b2, s2);
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
static { // register this comparator
WritableComparator.define(IntWritable.class, new Comparator());
}
}

代码最后的static块调用WritableComparator的静态方法define()来注册上面这个Comparator，就是将其加入WritableComparator的comparators成员中，comparators是HashMap类型且是static的，这样，就告诉WritableComparator，当我们使用WritableComparator.get(IntWritable.class);方法的时候，返回我注册的这个Comparator给我(对于IntWritable来说就是IntWritable.Comparator)，然后我就可以使用comparator.compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2)来比较b1和b2，而不需要将它反序列化成对象。Comparaotr.compare(byte[] b1,int s1,int l1,byte[] b2,int s2,int l2);中有个readInt()方法，它是从WritableComparator继承而来的，它将IntWritable的value从byte数组中通过移位转换出来。

注：当comparators中没有注册要比较的类的Comparator，则会返回一个默认的Comparator，然后使用这个默认的Comparator的compare(byte[] b1,int s1,int l1,byte[] b2, int s2,int l2)方法比较b1、b2的时候还要序列化成对象的，详情见后面细讲WritableComparator。

LongWritable的方法基本和IntWritable一样，区别就是LongWritable的值是long型，且多了一个额外的LongWritable.DecresingComparator，它继承于LongWritable.Comparator，只是它的的比较方法返回值与使用LongWritable.Comparator比较相反(取负)，这个应该是为降序排序做准备的。

[java] view plain copy

public class LongWritable implements WritableComparable {
private long value;
public LongWritable() {}
public LongWritable(long value) { set(value); }
/** Set the value of this LongWritable. */
public void set(long value) { this.value = value; }
/** Return the value of this LongWritable. */
public long get() { return value; }
public void readFields(DataInput in) throws IOException {
value = in.readLong();
}
public void write(DataOutput out) throws IOException {
out.writeLong(value);
}
/** Returns true iff <code>o</code> is a LongWritable with the same value. */
public boolean equals(Object o) {
if (!(o instanceof LongWritable))
return false;
LongWritable other = (LongWritable)o;
return this.value == other.value;
}
public int hashCode() {
return (int)value;
}
/** Compares two LongWritables. */
public int compareTo(Object o) {
long thisValue = this.value;
long thatValue = ((LongWritable)o).value;
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public String toString() {
return Long.toString(value);
}
/** A Comparator optimized for LongWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(LongWritable.class);
}
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
long thisValue = readLong(b1, s1);
long thatValue = readLong(b2, s2);
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
/** A decreasing Comparator optimized for LongWritable. */
public static class DecreasingComparator extends Comparator {
public int compare(WritableComparable a, WritableComparable b) {
return -super.compare(a, b);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return -super.compare(b1, s1, l1, b2, s2, l2);
}
}
static { // register default comparator
WritableComparator.define(LongWritable.class, new Comparator());
}
}

另外，ByteWritable、BooleanWritable、FloatWritable、DoubleWritable基本都一样。

然后我们看VIntWritable和VLongWritable，这两个基本类基本一样而且VIntWritable的value编码的时候也是使用VLongWritable的value编解码时的方法，主要区别是VIntWritable对象使用int型的value成员，而VLongWritable使用long型的value成员，这是由它们的取值范围决定的。它们都没有ComParator，不像上面的类。

我们只看VLongWritable即可，源码如下：

[java] view plain copy

public class VLongWritable implements WritableComparable {
private long value;
public VLongWritable() {}
public VLongWritable(long value) { set(value); }
/** Set the value of this LongWritable. */
public void set(long value) { this.value = value; }
/** Return the value of this LongWritable. */
public long get() { return value; }
public void readFields(DataInput in) throws IOException {
value = WritableUtils.readVLong(in);
}
public void write(DataOutput out) throws IOException {
WritableUtils.writeVLong(out, value);
}
/** Returns true iff <code>o</code> is a VLongWritable with the same value. */
public boolean equals(Object o) {
if (!(o instanceof VLongWritable))
return false;
VLongWritable other = (VLongWritable)o;
return this.value == other.value;
}
public int hashCode() {
return (int)value;
}
/** Compares two VLongWritables. */
public int compareTo(Object o) {
long thisValue = this.value;
long thatValue = ((VLongWritable)o).value;
return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
}
public String toString() {
return Long.toString(value);
}
}

从源码中我们可以发现，它编码的时候使用的是WritableUtils.writeVLong()方法，WritableUtils是关于编解码用的，暂时只看关于VIntWritable和VLongWritable的。

VIntWritable中value的编码实际也是使用writeVLong(),代码如下：

[java] view plain copy

public static void writeVInt(DataOutput stream, int i) throws IOException {
writeVLong(stream, i);
}

首先，序列化大小对比如下图：

技术分享

VIntWritable的长度是1-5，VLongWritable的长度是1-9，如果数值在【-112,127】时，使用1byte表示，即编码后的1byte存储的就是这个数值。如果不在这个范围内，则需要更多的byte，而第一个byte将被用作存储长度，其他byte存储数值。

writeVlong()的操作过程如下图：

技术分享

WritableUtils.writeVLong()的源码如下：

[java] view plain copy

public static void writeVLong(DataOutput stream, long i) throws IOException {
if (i >= -112 && i <= 127) {
stream.writeByte((byte)i);
return;//-112到127的数值只用一个byte
}
int len = -112;
if (i < 0) {
i ^= -1L; // take one‘s complement‘~1=(11111111)2得到这个i_2,i_2+1=[i],可想一下负数的反码如何能得到正数(连符号一起取反+1)
len = -120;
}
long tmp = i;//到这里，i一定是正数，这个数介于【0,2^64-1】
while (tmp != 0) {//然后用循环计算一下长度，i越大，实际长度就越大，偏离长度起始值[原来len]越大，len值越小。
tmp = tmp >> 8;
len--;
}
//现在，我们显然计算出了一个能表示其长度的值len，只要看其偏离长度起始值多少即可。
stream.writeByte((byte)len);
len = (len < -120) ? -(len + 120) : -(len + 112);//计算出了长度，不包含第一个byte【表示长度的byte】
for (int idx = len; idx != 0; idx--) {//这里将i的二进制码从左到右8位8位的拿出来，然后写入到流中。
int shiftbits = (idx - 1) * 8;
long mask = 0xFFL << shiftbits;
stream.writeByte((byte)((i & mask) >> shiftbits));
}
}

现在知道它是怎么写出去的了，再看看它是怎么读进来的，这显然是个反过程，WritableUtils.readVLong()源码如下：

[java] view plain copy

public static long readVLong(DataInput stream) throws IOException {
byte firstByte = stream.readByte();
int len = decodeVIntSize(firstByte);
if (len == 1) {
return firstByte;
}
long i = 0;
for (int idx = 0; idx < len-1; idx++) {
byte b = stream.readByte();
i = i << 8;
i = i | (b & 0xFF);
}
return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
}

这显然就是读出字节表示长度，然后从输入流中一个byte一个byte读出来，&0xFF是为了不让系统自动类型转换，然后在^-1L即连符号一起取反。
WritableUtils.decodeVIntSize()就是获取编码的长度，源码如下：

[java] view plain copy

public static int decodeVIntSize(byte value) {
if (value >= -112) {
return 1;
} else if (value < -120) {
return -119 - value;
}
return -111 - value;
}

显然，就是按照上面图中的反过程，使用了-119和-111只是为了获取编码长度而不是实际数值长度(不包含表示长度的第一个byte)而已。

下面我们继续前面提到的WritableComparator，它实现了RawComparator接口，RawComparator的源码很简单，如下：

[java] view plain copy

public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}

WritableComparator是RawComparator实例的工厂(注册了Writable的实现类)，它为这些Writable实现类提供了反序列化用的方法，这些方法都比较简单，比较难的是readVInt()和readVLong()，也就是上面说到的两个内容，WritableComparator还提供了compare()的默认实现，它会反序列化后才进行比较。如果WritableComparator.get()没有得到注册的Comparator，则会创建一个新的Comparator(其实就是一个WritableComparator的实例)，然后当你使用public int compare(byte[] b1,int s1,int l1,byte[] b2,ints2,int l2)进行比较的时候，它会去使用你要比较Writable的实现的readFields()方法读出value来。

比如：VIntWritable没有注册，我们get()时他就构造一个WritableComparator的实例，然后设置key1,key2,buffer,keyClass,当你使用compare(byte[] b1,int s1,int l1,byte[]2,int s2,int l2)时，则使用VIntWritable.readField()从编码后的byte[]中读取value值再进行比较。

然后是ArrayWritable和TwoDArrayWritable，这两个Writable实现分别是对一维数组和二维数组的封装，不难想象他们都应该提供一个Writable数组和保持关于这个数组的类型，而且序列化和反序列化也将使用封装的Writable实现的readFields()方法和write()方法。

[java] view plain copy

public class TwoDArrayWritable implements Writable {
private Class valueClass;
private Writable[][] values;
public TwoDArrayWritable(Class valueClass) {
this.valueClass = valueClass;
}
public TwoDArrayWritable(Class valueClass, Writable[][] values) {
this(valueClass);
this.values = values;
}
public Object toArray() {
int dimensions[] = {values.length, 0};
Object result = Array.newInstance(valueClass, dimensions);
for (int i = 0; i < values.length; i++) {
Object resultRow = Array.newInstance(valueClass, values[i].length);
Array.set(result, i, resultRow);
for (int j = 0; j < values[i].length; j++) {
Array.set(resultRow, j, values[i][j]);
}
}
return result;
}
public void set(Writable[][] values) { this.values = values; }
public Writable[][] get() { return values; }
public void readFields(DataInput in) throws IOException {
// construct matrix
values = new Writable[in.readInt()][];
for (int i = 0; i < values.length; i++) {
values[i] = new Writable[in.readInt()];
}
// construct values
for (int i = 0; i < values.length; i++) {
for (int j = 0; j < values[i].length; j++) {
Writable value; // construct value
try {
value = (Writable)valueClass.newInstance();
} catch (InstantiationException e) {
throw new RuntimeException(e.toString());
} catch (IllegalAccessException e) {
throw new RuntimeException(e.toString());
}
value.readFields(in); // read a value
values[i][j] = value; // store it in values
}
}
}
public void write(DataOutput out) throws IOException {
out.writeInt(values.length); // write values
for (int i = 0; i < values.length; i++) {
out.writeInt(values[i].length);
}
for (int i = 0; i < values.length; i++) {
for (int j = 0; j < values[i].length; j++) {
values[i][j].write(out);
}
}
}
}

也就是那样，没什么好讲的了。

另外还有一些如TupleWritable,AbstractMapWritable(MapWritable,SortMapWritable),DBWritable,CompressedWritable,VersionedWritable,GenericWritable之类等，有必要时再去谈他们，其实也差不多，功能不一样而已。

以上是关于Hadoop序列化与Writable源码分析的主要内容，如果未能解决你的问题，请参考以下文章

Hadoop序列化与Writable接口

深入对比Java与Hadoop大数据序列化机制Avro

大数据-Hadoop生态(12)-Hadoop序列化和源码追踪

Hadoop-序列化接口Writable和SequenceFile

MapReduce程序之序列化原理与Writable案例

序列化与Writable实现