使用 MapReduce 在 HBase 中插入多行

Posted 2023-04-17

技术标签:

【中文标题】使用 MapReduce 在 HBase 中插入多行【英文标题】：Multiple rows insertion in HBase using MapReduce 【发布时间】：2016-06-21 10:50:01 【问题描述】：

我想从每个映射器批量插入 N 行到 HBase 表。我目前知道这样做的两种方法：

put(List<Put> puts)

autoFlush

context.write(rowKey, put)

哪个更好？

在第一种方式中，context.write() 不是必需的，因为hTable.put(putsList) 方法用于直接将数据放入表中。我的映射器类正在扩展Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>，那么我应该为KEYOUT 和VALUEOUT 使用哪些类？

在第二种方式中，我必须拨打context.write(rowKey, put) N 次。有什么方法可以让我使用context.write() 列出Put 操作？

还有其他方法可以使用 MapReduce 做到这一点吗？

提前致谢。

【问题讨论】：

为什么是单个映射器，为什么不是多个映射器？你如何指定映射器的数量？即使您指定这是对代码的建议，也不能保证映射器的数量是一个。您可以使用 setNumMapTasks 或 conf.set('mapred.map.tasks','numberofmappersyouwanttoset') 更改映射器的数量（但它是对配置的建议），但不能保证映射器实例会被设置。此外，它取决于输入拆分。看***.com/questions/37239944/…请看我详尽的回答..随时提问。对于使用你的第一种方式，可以像“public class HBasePutOrDeleteMapper extends TableMapper” 我不知道你从哪里得到的例子。 “我的映射器类正在扩展类映射器，那么我应该为 KEYOUT 和 VALUEOUT 使用哪些类” 【参考方案1】：

我更喜欢第二个选项，其中批处理是自然的（不需要列表 puts) for mapreduce....要深入了解，请参阅我的第二点

1) 您的第一个选项List<Put> 通常用于独立 Hbase Java 客户端。在内部，它由hbase.client.write.buffer 控制，如下所示，在您的一个配置 xmls 中

<property>
         <name>hbase.client.write.buffer</name>
         <value>20971520</value> // around 2 mb i guess
 </property>

默认值为 2mb。一旦你的缓冲区被填满，它就会刷新所有 put 以实际插入到你的表中。这与 #2 中解释的 BufferedMutator 相同

2) 关于第二个选项，如果您看到 TableOutputFormat 文档

org.apache.hadoop.hbase.mapreduce
Class TableOutputFormat<KEY>

java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
org.apache.hadoop.hbase.mapreduce.TableOutputFormat<KEY>
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

@InterfaceAudience.Public
@InterfaceStability.Stable
public class TableOutputFormat<KEY>
extends org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
implements org.apache.hadoop.conf.Configurable
Convert Map/Reduce output and write it to an HBase table. The KEY is ignored

while the output value must be either a Put or a Delete instance.

-- 通过code 看到的其他方式如下所示。

/**
     * Writes a key/value pair into the table.
     *
     * @param key  The key.
     * @param value  The value.
     * @throws IOException When writing fails.
     * @see RecordWriter#write(Object, Object)
     */
    @Override
    public void write(KEY key, Mutation value)
    throws IOException 
      if (!(value instanceof Put) && !(value instanceof Delete)) 
        throw new IOException("Pass a Delete or a Put");
      
      mutator.mutate(value);

结论：context.write(rowkey,putlist) API 是不可能的。

但是，BufferedMutator（来自上面代码中的 mutator.mutate）说

Map/reduce jobs benefit from batching, but have no natural flush point. @code BufferedMutator receives the puts from the M/R job and will batch puts based on some heuristic, such as the accumulated size of the puts, and submit batches of puts asynchronously so that the M/R logic can continue without interruption.

如前所述，您的批处理是自然的（使用 BufferedMutator）

【讨论】：

以上是关于使用 MapReduce 在 HBase 中插入多行的主要内容，如果未能解决你的问题，请参考以下文章

hbase怎么查询表里的总纪录数

自定义 HBase-MapReduce

熟悉常用的HBase操作，编写MapReduce作业