airbnb 开源reAir 工具 用法及源码解析

Posted jiangxiaoxian

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了airbnb 开源reAir 工具 用法及源码解析相关的知识,希望对你有一定的参考价值。

reAir 有批量复制与增量复制功能 今天我们先来看看批量复制功能

批量复制使用方式:

cd reair
./gradlew shadowjar -p main -x test
# 如果是本地table-list 一样要加file:/// ; 如果直接写  --table-list ~/reair/local_table_list ,此文件必须在hdfs上!
hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hive.MetastoreReplicationJob --config-files my_config_file.xml --table-list file:///reair/local_table_list

1.table_list 内容

db_name1.table_name1
db_name1.table_name2
db_name2.table_name3
...
  1. my_config_file.xml 配置
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>airbnb.reair.clusters.src.name</name>
    <value>ns8</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.metastore.url</name>
    <value>thrift://192.168.200.140:9083</value>
    <comment>Source metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.root</name>
    <value>hdfs://ns8/</value>
    <comment>Source cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.tmp</name>
    <value>hdfs://ns8/tmp</value>
    <comment>
      Directory for temporary files on the source cluster.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.name</name>
    <value>ns1</value>
    <comment>
      Name of the destination cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.metastore.url</name>
    <value>thrift://dev04:9083</value>
    <comment>Destination metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.root</name>
    <value>hdfs://ns1/</value>
    <comment>Destination cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.tmp</name>
    <value>hdfs://ns1/tmp</value>
    <comment>
      Directory for temporary files on the source cluster. Table / partition
      data is copied to this location before it is moved to the final location,
      so it should be on the same filesystem as the final location.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.output.dir</name>
    <value>/tmp/replica</value>
    <comment>
      This configuration must be provided. It gives location to store each stage
      MR job output.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.metastore.blacklist</name>
    <value>testdb:test.*,tmp_.*:.*</value>
    <comment>
      Comma separated regex blacklist. dbname_regex:tablename_regex,...
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.metastore.parallelism</name>
    <value>5</value>
    <comment>
      The parallelism to use for jobs requiring metastore calls. This translates to the number of
      mappers or reducers in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.copy.parallelism</name>
    <value>5</value>
    <comment>
      The parallelism to use for jobs that copy files. This translates to the number of reducers
      in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.overwrite.newer</name>
    <value>true</value>
    <comment>
      Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.
    </comment>
  </property>
 
  <property>
    <name>mapreduce.map.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

  <property>
    <name>mapreduce.reduce.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

</configuration>

一、批量复制

批量复制有三个步骤(stage)

1.读取用户配置的table-list(及从src元数据获得对应表的分区),shuffle 到各个reduce中,reduce读取 metastore及集群信息做好拷贝文件的映射关系写到hdfs中。
2.遍历第一个mr生成作业列表,根据路径shuffle到不同reduce,执行复制。
3.处理hive metastore 提交逻辑只用map

官方图:

技术分享图片

以上是关于airbnb 开源reAir 工具 用法及源码解析的主要内容,如果未能解决你的问题,请参考以下文章

音视频开发之旅(62) -Lottie 源码分析之json解析

音视频开发之旅(62) -Lottie 源码分析之json解析

音视频开发之旅(62) -Lottie 源码分析之json解析

AndFix Bug 热修复框架原理及源码解析

Vue-Router 源码解析 router-link组件的用法及原理

原Android热更新开源项目Tinker源码解析系列之二:资源文件热更新