Hadoop跨集群迁移数据(整理版)

Posted 牧梦者

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hadoop跨集群迁移数据(整理版)相关的知识,希望对你有一定的参考价值。

1. 什么是DistCp

  DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。它使用Map/Reduce实现文件分发,错误处理和恢复,以及报告生成。它把文件和目录的列表作为map任务的输入,每个任务会完成源列表中部分文件的拷贝。由于使用了Map/Reduce方法,这个工具在语义和执行上都会有特殊的地方。

1.1 DistCp使用的注意事项

  1. DistCp会尝试着均分需要拷贝的内容,这样每个map拷贝差不多相等大小的内容。但因为文件是最小的拷贝粒度,所以配置增加同时拷贝(如map)的数目不一定会增加实际同时拷贝的数目以及总吞吐量。

  2. 如果没使用-m选项,DistCp会尝试在调度工作时指定map的数据为 min (total_bytes / bytes.per.map, 20 * num_task_trackers),其中bytes.per.map默认是256MB。

  3. 建议对于长时间运行或定期运行的作业,根据源和目标集群大小、拷贝数量大小以及带宽调整map的数目。

  4. 对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。这是一个只读文件系统,所以DistCp必须运行在目标端集群上(更确切的的说是能够写入目标集群的TaskTracker上)。源的格式是 hftp://<dfs.http.address>/<path> (默认情况dfs.http.address是 <namenode>:50070)

2. Hadoop DistCp的api使用

[root@node105 ~]# hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                       Reuse existing data in target files and
                               append new data to them if possible
 -async                        Should distcp execution be blocking
 -atomic                       Commit all changes or none
 -bandwidth <arg>              Specify bandwidth per map in MB
 -blocksperchunk <arg>         If set to a positive value, fileswith more
                               blocks than this value will be split into
                               chunks of <blocksperchunk> blocks to be
                               transferred in parallel, and reassembled on
                               the destination. By default,
                               <blocksperchunk> is 0 and the files will be
                               transmitted in their entirety without
                               splitting. This switch is only applicable
                               when the source file system implements
                               getBlockLocations method and the target
                               file system implements concat method
 -copybuffersize <arg>         Size of the copy buffer to use. By default
                               <copybuffersize> is 8192B.
 -delete                       Delete from target, files missing in source
 -diff <arg>                   Use snapshot diff report to identify the
                               difference between source and target
 -f <arg>                      List of files that need to be copied
 -filelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n
 -filters <arg>                The path to a file containing a list of
                               strings for paths to be excluded from the
                               copy.
 -i                            Ignore failures during copy
 -log <arg>                    Folder on DFS where distcp execution logs
                               are saved
 -m <arg>                      Max number of concurrent maps to use for
                               copy
 -mapredSslConf <arg>          Configuration for ssl config file, to use
                               with hftps://. Must be in the classpath.
 -numListstatusThreads <arg>   Number of threads to use for building file
                               listing (max 40).
 -overwrite                    Choose to overwrite target files
                               unconditionally, even if they exist.
 -p <arg>                      preserve status (rbugpcaxt)(replication,
                               block-size, user, group, permission,
                               checksum-type, ACL, XATTR, timestamps). If
                               -p is specified with no <arg>, then
                               preserves replication, block size, user,
                               group, permission, checksum type and
                               timestamps. raw.* xattrs are preserved when
                               both the source and destination paths are
                               in the /.reserved/raw hierarchy (HDFS
                               only). raw.* xattrpreservation is
                               independent of the -p flag. Refer to the
                               DistCp documentation for more details.
 -rdiff <arg>                  Use target snapshot diff report to identify
                               changes made on target
 -sizelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n bytes
 -skipcrccheck                 Whether to skip CRC checks between source
                               and target paths.
 -strategy <arg>               Copy strategy to use. Default is dividing
                               work based on file sizes
 -tmp <arg>                    Intermediate work path to be used for
                               atomic commit
 -update                       Update target, copying only missingfiles or
                               directories

3. 测试用例

  1. 查看将要迁移的目标文件

[root@calculation101 ~]# hdfs dfs -du -h /test/2018/10/

  2. 创建新集群的测试目录:

[hdfs@node105 root]$ 
[hdfs@node105 root]$ hdfs dfs -mkdir -p /yangjianqiu/data/
[hdfs@node105 root]$ 
[hdfs@node105 root]$ hdfs dfs -chown -R root:root  /yangjianqiu/data/  
[hdfs@node105 root]$ 
[hdfs@node105 root]$ exit 
exit
[root@node105 ~]# 
[root@node105 ~]# hdfs dfs -ls /yangjianqiu
Found 1 items
drwxr-xr-x   - root root          0 2018-10-29 03:29 /yangjianqiu/data

  2. 开始迁移数据I并记录日志以及迁移数据所用时间:

[root@node105 ~]# mkdir /yangjianqiu
[root@node105 ~]# 
[root@node105 ~]# 
[root@node105 ~]# nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 & 
[
1] 11125
[root@node105
~]#
[root@node105
~]# jobs
[
1]+ Running nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &

4. 应用程序调用distcp接口

总结

【参考资料】

https://blog.bcmeng.com/post/hbase-bulkload.html Hive 数据 bulkload 导入 HBase

https://blog.csdn.net/levy_cui/article/details/70156682  hadoop跨集群之间迁移hive数据

http://blog.itpub.net/30089851/viewspace-2062010 hadoop 集群跨版本数据迁移

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/administration/content/distcp_between_ha_clusters.html DistCp between HA clusters

https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_admin_distcp_data_cluster_migrate.html  Copying Cluster Data Using DistCp

https://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.tools.DistCp Java Code Examples for org.apache.hadoop.tools.DistCp

https://www.cnblogs.com/yinzhengjie/p/9872365.html HDFS集群PB级数据迁移方案-DistCp生产环境实操篇

以上是关于Hadoop跨集群迁移数据(整理版)的主要内容,如果未能解决你的问题,请参考以下文章

HDFS跨集群(Insecure To Secure)数据迁移实战

自建Hive数据仓库跨版本迁移到阿里云E-MapReduce

hbase 数据迁移

自建Hive数据仓库跨版本迁移到阿里云Databricks数据洞察

GaussDB 200跨集群访问HDFS

Hadoop每日一讨论整理版