序列化与Writable实现

Posted 2020-06-12

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了序列化与Writable实现相关的知识，希望对你有一定的参考价值。

简介

在Hadoop中，Writable的实现类是个庞大的家族，我们在这里简单的介绍一下常用来做序列化的一部分。

Java原来类型

除char类型以外，所有的原生类型都有对应的Writable类，并且大部分通过get和set方法可以操作他们的值。

IntWritable和LongWritable还有对应的变长VIntWritable和VLongWritable类

固定长度还是变长的选用类似于数据库中的char或者vchar，在这里就不再赘述了。

Text类型

Text类型使用变长int型存储长度，所以Text类型的最大存储为2G

Text类型采用标准的UTF-8编码，所以与其他文本工具可以非常好的交互，但要注意的是，这样的话和java的String类型差别就很大了。

检索方式的不同

Text的CharAt返回的是一个整形，即UTF-8编码后的数字，而不是像String那样的unicode编码的char类型。

[java] view plain copy

@Test
public void testTextIndex(){
Text text=new Text("hadoop");
Assert.assertEquals(text.getLength(), 6);
Assert.assertEquals(text.getBytes().length, 6);
Assert.assertEquals(text.charAt(2),(int)‘d‘);
Assert.assertEquals("Out of bounds",text.charAt(100),-1);
}

Text还有个find方法，类似于String的indexOf方法，下标从0开始

[java] view plain copy

@Test
public void testTextFind() {
Text text = new Text("hadoop");
Assert.assertEquals("find a substring",text.find("do"),2);
Assert.assertEquals("Find first ‘o‘",text.find("o"),3);
Assert.assertEquals("Find ‘o‘ from position 4 or later",text.find("o",4),4);
Assert.assertEquals("No match",text.find("pig"),-1);
}

Unicode的不同

当utf-8编码后的字节大于两个时，Text和String的区别就会更清晰，因为String是按照Unicode的char计算，而Text是按照字节计算。

我们来看下1到4个字节的不同的Unicode字符

技术分享

4个Unicode分别占用1到4个字节，u+10400在java的Unicode字符中占用两个char，前三个字符分别占用1个char，我们通过代码来看下String和Text的不同。

[java] view plain copy

@Test
public void string() throws UnsupportedEncodingException {
String str = "\u0041\u00DF\u6771\uD801\uDC00";
Assert.assertEquals(str.length(), 5);
Assert.assertEquals(str.getBytes("UTF-8").length, 10);
Assert.assertEquals(str.indexOf("\u0041"), 0);
Assert.assertEquals(str.indexOf("\u00DF"), 1);
Assert.assertEquals(str.indexOf("\u6771"), 2);
Assert.assertEquals(str.indexOf("\uD801\uDC00"), 3);
Assert.assertEquals(str.charAt(0), ‘\u0041‘);
Assert.assertEquals(str.charAt(1), ‘\u00DF‘);
Assert.assertEquals(str.charAt(2), ‘\u6771‘);
Assert.assertEquals(str.charAt(3), ‘\uD801‘);
Assert.assertEquals(str.charAt(4), ‘\uDC00‘);
Assert.assertEquals(str.codePointAt(0), 0x0041);
Assert.assertEquals(str.codePointAt(1), 0x00DF);
Assert.assertEquals(str.codePointAt(2), 0x6771);
Assert.assertEquals(str.codePointAt(3), 0x10400);
}
@Test
public void text() {
Text text = new Text("\u0041\u00DF\u6771\uD801\uDC00");
Assert.assertEquals(text.getLength(), 10);
Assert.assertEquals(text.find("\u0041"), 0);
Assert.assertEquals(text.find("\u00DF"), 1);
Assert.assertEquals(text.find("\u6771"), 3);
Assert.assertEquals(text.find("\uD801\uDC00"), 6);
Assert.assertEquals(text.charAt(0), 0x0041);
Assert.assertEquals(text.charAt(1), 0x00DF);
Assert.assertEquals(text.charAt(3), 0x6771);
Assert.assertEquals(text.charAt(6), 0x10400);
}

这样一比较久很明显了。

1.String的length()方法返回的是char的数量，Text的getLength()方法返回的是字节的数量。

2.String的indexOf()方法返回的是以char为单位的偏移量，Text的find()方法返回的是以字节为单位的偏移量。

3.String的charAt()方法不是返回的这个Unicode字符，返回的是java中的char字符。

4.String的codePointAt()和Text的charAt()方法比较类似，不过要注意，前者是char的偏移量，后者是字节的偏移量。

Text的迭代

在Text中对Unicode字符的迭代时相当复杂的，因为与Unicode所占字节数有关，不能简单的使用index的增长来确定。首先要把Text对象使用ByteBuffer进行封装，然后再调用Text的静态方法bytesToCodePoint对ByteBuffer进行轮询返回Unicode字符的code point。看一下示例代码：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.Text;
import java.nio.ByteBuffer;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-9
* Time: 下午5:00
* To change this template use File | Settings | File Templates.
*/
public class TextIterator {
public static void main(String[] args) {
Text text = new Text("\u0041\u00DF\u6771\uD801\udc00");
ByteBuffer buffer = ByteBuffer.wrap(text.getBytes(), 0, text.getLength());
int cp;
while (buffer.hasRemaining() && (cp = Text.bytesToCodePoint(buffer)) != -1) {
System.out.println(Integer.toHexString(cp));
}
}
}

Text的修改

除了NullWritable是不可以更改之外，其他类型的Writable都是可以修改的，你可以通过Text的set方法进行修改重用这个实例。

[java] view plain copy

@Test
public void testTextMutability() {
Text text = new Text("hadoop");
text.set("pig");
Assert.assertEquals(text.getLength(), 3);
Assert.assertEquals(text.getBytes().length, 3);
}

注：Text的取值比较特殊，使用XXX.toString()方法，其他大部分都提供了set和get方法。

BytesWritable类型

BytesWritable类型是一个二进制数组的封装类型，序列化格式是以一个4字节的整数(这点与Text不同，Text是以变长int开头)开始表明字节数组的长度，然后接下来才是数组本身，看下面的示例：

[java] view plain copy

@Test
public void testByteWritableSerilizedFromat() throws IOException {
BytesWritable bytesWritable=new BytesWritable(new byte[]{3,5});
byte[] bytes=SerializeUtils.serialize(bytesWritable);
Assert.assertEquals(StringUtils.byteToHexString(bytes),"000000020305");
}

和Text一样，ByteWritable也可以通过set方法修改，getLength返回的大小是真实大小，而getBytes返回的大小却不是

[java] view plain copy

bytesWritable.setCapacity(11);
bytesWritable.setSize(4);
Assert.assertEquals(4,bytesWritable.getLength());
Assert.assertEquals(11,bytesWritable.getBytes().length);

NullWritable类型

NullWritable是一个非常特殊的Writable类型，序列化不包含任何字符串，仅仅相当于占位符。在使用MapReduce时，key或者value在无需使用的时候，可以定义为NullWritable。

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-16
* Time: 下午9:23
* To change this template use File | Settings | File Templates.
*/
public class TestNullWritable {
public static void main(String[] args) throws IOException {
NullWritable nullWritable=NullWritable.get();
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(nullWritable)));
}
}

注：NullWritable是通过NullWritable.get()方法获取的。

ObjectWritable类型

ObjectWritable是其他类型的封装类，包括java原生类型，String，enum,writable,null等，或者这些类型构成的数组。当你的一个field有多种类型时，ObjectWritable类型的用处就发挥出来了，不过有个不好的地方就是占用的空间太大，即使你存一个字母，因为它需要保存封装前的类型，我们来看下示例：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午9:14
* To change this template use File | Settings | File Templates.
*/
public class TestObjectWritable {
public static void main(String[] args) throws IOException {
Text text=new Text("\u0041");
ObjectWritable objectWritable=new ObjectWritable(text);
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(objectWritable)));
}
}

我们仅仅保存了一个字母，但是序列化之后的结果居然是：

[java] view plain copy

00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141

太浪费空间了。不建议使用，建议使用GenericWritable类型

GenericWritable类型

使用GenericWritable时，只需要继承于它，并通过重写getTypes方法制定哪些类型需要支持即可，我们看下方法：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
class MyWritable extends GenericWritable {
MyWritable(Writable writable) {
set(writable);
}
public static Class<? extends Writable>[] CLASSES=null;
static {
CLASSES= (Class<? extends Writable>[])new Class[]{
Text.class
};
}
@Override
protected Class<? extends Writable>[] getTypes() {
return CLASSES; //To change body of implemented methods use File | Settings | File Templates.
}
}

测试类：

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.VIntWritable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午9:51
* To change this template use File | Settings | File Templates.
*/
public class TestGenericWritable {
public static void main(String[] args) throws IOException {
Text text=new Text("\u0041\u0071");
MyWritable myWritable=new MyWritable(text);
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(text)));
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(myWritable)));
}
}

结果是：

[html] view plain copy

024171
00024171

GenericWritable的序列化只是把类型在type数组里的索引放在了前面，这样就比ObjectWritable节省了很多空间，所以推荐大家使用GenericWritable。

集合类型的Writable

ArrayWritable和TwoDArrayWritable

ArrayWritable和TwoDArrayWritable分别表示数组和二维数组的Writable类型，指定数组的类型有两种方式：通过构造方法设置；继承于ArrayWritable，TwoDArrayWritable也是一样的。

[java] view plain copy

package com.sweetop.styhadoop;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;
import java.io.IOException;
/**
* Created with IntelliJ IDEA.
* User: lastsweetop
* Date: 13-7-17
* Time: 上午11:14
* To change this template use File | Settings | File Templates.
*/
public class TestArrayWritable {
public static void main(String[] args) throws IOException {
ArrayWritable arrayWritable=new ArrayWritable(Text.class);
arrayWritable.set(new Writable[]{new Text("\u0071"),new Text("\u0041")});
System.out.println(StringUtils.byteToHexString(SerializeUtils.serialize(arrayWritable)));
}
}

看下输出：

[html] view plain copy

0000000201710141

可知，ArrayWritable以一个整数开始表示数组长度，然后数组里的元素一一排开。

ArrayPrimitiveWritable和上面类似，只是不需要用子类去继承ArrayWritable而已。

MapWritable和SortMapWritable

MapWritable对应Map，SortedMapWritable对应SortedMap，以4个字节开头，存储集合大小，然后每个元素以一个字节开头存储类型的索引。

这里没有看到Set和List集合，这个是可以代替实现的，用MapWritable代替Set，SortMapWritable代替SortedMap，只需要将他们的values设置成NullWritable即可，NullWritable不占用空间。相同类型的List，可以用ArrayWritable代替，不同类型的List可以用GenericWritable类型代替，然后再使用ArrayWritable封装。当然MapWritable一样可以实现List，把Key设置为索引，values做成list里的元素。

注：还有一些类型比如：DoubleWritable等比较简单，就不再赘述。

文章来源：http://blog.csdn.net/lastsweetop/article/details/9249411
代码下载：https://github.com/lastsweetop/styhadoop

以上是关于序列化与Writable实现的主要内容，如果未能解决你的问题，请参考以下文章

自定义Writable类型

Hadoop序列化与Writable接口

MapReduce程序之序列化原理与Writable案例

Hadoop-2.4.1学习之Writable及其实现

大数据之Hadoop(MapReduce)：自定义bean对象实现序列化接口（Writable）

Hadoop-序列化接口Writable和SequenceFile