如何备份hadoop数据
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何备份hadoop数据相关的知识,希望对你有一定的参考价值。
参考技术A Hadoop默认的数据就存储了三份的,一般不需要做备份的,不知道你这里备份是要做什么? 参考技术B hadoop备份的是数据块(就是将每个文件分割为128M大小的数据块,不够的按128M进行存储)对每个数据块进行备份然后存放到子节点上(并且每个子节点上存放的不会是同一个数据块)0
假如备份数为3份,就会把文件分割后的每个数据块进行备份,存放到不同的子节点上,当一个子节点出现异常,就会通过反馈机制将信息反馈给主节点,主节点就会将该节点上的数据块进行重新备份存放到其他子节点上,使得数据块的备份数保持在自定义数
这样就可以保证数据不会丢失,体现了hadoop的容错性(就是允许子节点出错,因为你出错了我还有备份的数据) 参考技术C
可以百度搜一下:Hadoop文件系统数据定时、实时备份和恢复的方法
Hadoop 是一个分布式系统基础架构,是一个分布式文件系统HDFS(Hadoop Distributed File System),对于那些有着超大数据集(large data set)的应用程序的企业,如大数据一般都会用到Hadoop文件系统。
通过UCache灾备云平台就可以在线进行Hadoop文件系统数据定时、实时备份和恢复,同时,Hadoop定时备份任务支持自动发现数据源的功能,对要备份的文件进行勾选,
注意事项
当集群有一个NameNode时,进行Hadoop文件系统备份,需要确保集群必须处在开启状态;当集群有多个NameNode时,需要一台active 一台standy,才可以正常使用,如果两个namenode都是standby,也需要执行相应的命令使集群保持正常状态。
详细的操作步骤可以看我在百度百家号的文章里有操作步骤截图!
elasticsearch备份与恢复4_使用ES-Hadoop将ES中的索引数据写入HDFS中
背景知识见链接:elasticsearch备份与恢复3_使用ES-Hadoop将HDFS数据写入Elasticsearch中
项目参考《Elasticsearch集成Hadoop最佳实践》的tweets2HdfsMapper项目
项目源码:https://gitee.com/constfafa/ESToHDFS.git
开发过程:
1. 先在kibana中查看下索引的信息
-
"hits": [
-
{
-
"_index": "xxx-words",
-
"_type": "history",
-
"_id": "zankHWUBk5wX4tbY-gpZ",
-
"_score": 1,
-
"_source": {
-
"word": "abc",
-
"createTime": "2018-08-09 16:56:00",
-
"userId": "263",
-
"datetime": "2018-08-09T16:56:00Z"
-
}
-
},
-
{
-
"_index": "xxx-words",
-
"_type": "history",
-
"_id": "zqntHWUBk5wX4tbYFAqy",
-
"_score": 1,
-
"_source": {
-
"word": "bcd",
-
"createTime": "2018-08-09 16:59:00",
-
"userId": "263",
-
"datetime": "2018-08-09T16:59:00Z"
-
}
-
}
-
]
之后直接执行 hadoop jar history2hdfs-job.jar
执行过程如下
-
[[email protected] jar]# hadoop jar history2hdfs-job.jar
-
18/06/07 04:04:36 WARN util.www.leyouzaixan.cn NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-
18/06/07 04:04:42 INFO client.RMProxy: Connecting to ResourceManager at /192.168.211.104:8032
-
18/06/07 04:04:48 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
-
18/06/07 04:04:55 INFO util.Version: Elasticsearch Hadoop v6.2.3 [039a45c5a1]
-
18/06/07 04:04:58 INFO mr.EsInputFormat: Reading from [hzeg-history-words/history]
-
18/06/07 04:04:58 INFO mr.EsInputFormat: Created [2] splits
-
18/06/07 04:05:00 INFO mapreduce.JobSubmitter: number of splits:2
-
18/06/07 04:05:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528305729734_0007
-
18/06/07 04:05:05 INFO impl.YarnClientImpl: Submitted application application_1528305729734_0007
-
18/06/07 04:05:06 INFO mapreduce.Job: The url to track the job: http://docker02:8088/proxy/application_1528305729734_0007/
-
18/06/07 04:05:06 INFO mapreduce.Job: Running job: job_1528305729734_0007
-
18/06/07 04:09:31 INFO mapreduce.Job: Job job_1528305729734_0007 running in uber mode : false
-
18/06/07 04:09:42 INFO mapreduce.Job: map 0% reduce 0%
-
18/06/07 04:15:36 INFO mapreduce.Job: map 100% reduce 0%
-
18/06/07 04:17:26 INFO mapreduce.Job: Job job_1528305729734_0007 completed successfully
-
18/06/07 04:17:56 INFO mapreduce.Job: Counters: 47
-
File System Counters
-
FILE: Number of bytes read=0
-
FILE: Number of bytes written=230906
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=74694
-
HDFS: Number of bytes written=222
-
HDFS: Number of read operations=8
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=4
-
Job Counters
-
Launched map tasks=2
-
Rack-local map tasks=2
-
Total time spent by all maps in occupied slots (ms)=791952
-
Total time spent by all reduces in occupied slots (ms)=0
-
Total time spent by all map tasks (ms)=791952
-
Total vcore-seconds taken by all map tasks=791952
-
Total megabyte-seconds taken by all map tasks=810958848
-
Map-Reduce Framework
-
Map input records=2
-
Map output records=2
-
Input split bytes=74694
-
Spilled Records=0
-
Failed Shuffles=0
-
Merged Map outputs=0
-
GC time elapsed (ms)=8106
-
CPU time spent (ms)=20240
-
Physical memory (bytes) snapshot=198356992
-
Virtual memory (bytes) snapshot=4128448512
-
Total committed heap usage (bytes)=32157696
-
File Input Format Counters
-
Bytes Read=0
-
File Output Format Counters
-
Bytes Written=222
-
Elasticsearch Hadoop Counters
-
Bulk Retries=0
-
Bulk Retries Total Time(ms)=0
-
Bulk Total=0
-
Bulk Total Time(ms)=0
-
Bytes Accepted=0
-
Bytes Received=1102
-
Bytes Retried=0
-
Bytes Sent=296
-
Documents Accepted=0
-
Documents Received=0
-
Documents Retried=0
-
Documents Sent=0
-
Network Retries=0
-
Network Total Time(ms)=5973
-
Node Retries=0
-
Scroll Total=1
-
Scroll Total Time(ms)=666
后面仍旧报job history server找不到,并不影响
执行下面语句检查文件及数据是否正确
可以看到,最终实现了将索引文件存入HDFS的功能
自己在实际使用中遇到了下面的几个问题
1. spring boot gradle项目为了使用此功能,加入如下依赖后,
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-core‘, version: ‘1.2.1‘
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-hdfs‘, version: ‘2.7.2‘
-
compile group: ‘org.elasticsearch‘, name: ‘elasticsearch-hadoop‘, version: ‘6.2.4‘
发现spring boot启动报错A child container failed during start
查看gradle dependency
其中hadoop-core:1.2.1
-
+--- org.apache.hadoop:hadoop-core:1.2.1
-
| +--- commons-cli:commons-cli:1.2
-
| +--- xmlenc:xmlenc:0.52
-
| +--- com.sun.jersey:jersey-core:1.8 -> 1.9
-
| +--- com.sun.jersey:jersey-json:1.8
-
| | +--- org.codehaus.jettison:jettison:1.1
-
| | | --- stax:stax-api:1.0.1
-
| | +--- com.sun.xml.bind:jaxb-www.fengshen157.com/ impl:2.2.3-1
-
| | | --- javax.xml.bind:jaxb-api:2.2.2 -> 2.3.0
-
| | +--- org.codehaus.jackson:jackson-core-asl:1.7.1 -> 1.9.13
-
| | +--- org.codehaus.jackson:jackson-mapper-asl:1.7.1 -> 1.9.13
-
| | | --- org.codehaus.jackson:jackson-core-asl:1.9.13
-
| | +--- org.codehaus.jackson:jackson-www.yongshiyule178.com jaxrs:1.7.1
-
| | | +--- org.codehaus.jackson:jackson-core-asl:1.7.1 -> 1.9.13
-
| | | --- org.codehaus.jackson:jackson-www.mhylpt.com mapper-asl:1.7.1 -> 1.9.13 (*)
-
| | +--- org.codehaus.jackson:jackson-xc:1.7.1
-
| | | +--- org.codehaus.jackson:jackson-www.dasheng178.com core-asl:1.7.1 -> 1.9.13
-
| | | --- org.codehaus.jackson:jackson-www.taohuayuan178.com mapper-asl:1.7.1 -> 1.9.13 (*)
-
| | --- com.www.tianjiuyule178.com sun.jersey:jersey-core:1.8 -> 1.9
-
| +--- com.sun.jersey: www.thd540.com jersey-server:1.8 -> 1.9
-
| | +--- asm:asm:3.1
-
| | --- com.sun.jersey:jersey-core:1.9
-
| +--- commons-io:commons-io:2.1 -> 2.4
-
| +--- commons-httpclient:commons-httpclient:3.0.1 -> 3.1
-
| | +--- commons-logging:commons-logging:1.0.4 -> 1.2
-
| | --- commons-codec:commons-codec:1.2 -> 1.11
-
| +--- commons-codec:commons-codec:1.4 -> 1.11
-
| +--- org.apache.commons:commons-math:2.1
-
| +--- commons-configuration:commons-configuration:1.6
-
| | +--- commons-collections:commons-collections:3.2.1
-
| | +--- commons-lang:commons-lang:2.4 -> 2.6
-
| | +--- commons-logging:commons-logging:1.1.1 -> 1.2
-
| | +--- commons-digester:commons-digester:1.8
-
| | | +--- commons-beanutils:commons-beanutils:1.7.0 (*)
-
| | | --- commons-logging:commons-logging:1.1 -> 1.2
-
| | --- commons-beanutils:commons-beanutils-core:1.8.0
-
| | --- commons-logging:commons-logging:1.1.1 -> 1.2
-
| +--- commons-net:commons-net:1.4.1
-
| | --- oro:oro:2.0.8
-
| +--- org.mortbay.jetty:jetty:6.1.26
-
| | +--- org.mortbay.jetty:jetty-util:6.1.26
-
| | --- org.mortbay.jetty:servlet-api:2.5-20081211
-
| +--- org.mortbay.jetty:jetty-util:6.1.26
-
| +--- tomcat:jasper-runtime:5.5.12
-
| +--- tomcat:jasper-compiler:5.5.12
-
| +--- org.mortbay.jetty:jsp-api-2.1:6.1.14
-
| | --- org.mortbay.jetty:servlet-api-2.5:6.1.14
-
| +--- org.mortbay.jetty:jsp-2.1:6.1.14
-
| | +--- org.eclipse.jdt:core:3.1.1
-
| | +--- org.mortbay.jetty:jsp-api-2.1:6.1.14 (*)
-
| | --- ant:ant:1.6.5
-
| +--- commons-el:commons-el:1.0
-
| | --- commons-logging:commons-logging:1.0.3 -> 1.2
-
| +--- net.java.dev.jets3t:jets3t:0.6.1
-
| | +--- commons-codec:commons-codec:1.3 -> 1.11
-
| | +--- commons-logging:commons-logging:1.1.1 -> 1.2
-
| | --- commons-httpclient:commons-httpclient:3.1 (*)
-
| +--- hsqldb:hsqldb:1.8.0.10
-
| +--- oro:oro:2.0.8
-
| +--- org.eclipse.jdt:core:3.1.1
-
| --- org.codehaus.jackson:jackson-mapper-asl:1.8.8 -> 1.9.13 (*)
hadoop-hdfs:2.7.1
-
+--- org.apache.hadoop:hadoop-hdfs:2.7.1
-
| +--- com.google.guava:guava:11.0.2 -> 18.0
-
| +--- org.mortbay.jetty:jetty:6.1.26 (*)
-
| +--- org.mortbay.jetty:jetty-util:6.1.26
-
| +--- com.sun.jersey:jersey-core:1.9
-
| +--- com.sun.jersey:jersey-server:1.9 (*)
-
| +--- commons-cli:commons-cli:1.2
-
| +--- commons-codec:commons-codec:1.4 -> 1.11
-
| +--- commons-io:commons-io:2.4
-
| +--- commons-lang:commons-lang:2.6
-
| +--- commons-logging:commons-logging:1.1.3 -> 1.2
-
| +--- commons-daemon:commons-daemon:1.0.13
-
| +--- log4j:log4j:1.2.17
-
| +--- com.google.protobuf:protobuf-java:2.5.0
-
| +--- javax.servlet:servlet-api:2.5
-
| +--- org.codehaus.jackson:jackson-core-asl:1.9.13
-
| +--- org.codehaus.jackson:jackson-mapper-asl:1.9.13 (*)
-
| +--- xmlenc:xmlenc:0.52
-
| +--- io.netty:netty-all:4.0.23.Final -> 4.1.22.Final
-
| +--- xerces:xercesImpl:2.9.1
-
| | --- xml-apis:xml-apis:1.3.04 -> 1.4.01
-
| +--- org.apache.htrace:htrace-core:3.1.0-incubating
-
| --- org.fusesource.leveldbjni:leveldbjni-all:1.8
原因:
发现其都是要依赖servlet-api-2.5的,而springboot要依赖于servlet3以及更高版本,
解决方案:
要exclude servlet-api-2.5
在build.gradle中配置
-
configurations{
-
all*.exclude group:‘javax.servlet‘
-
}
参考:
java - SpringBoot Catalina LifeCycle Exception - Stack Overflow
解决gradle管理依赖中 出现servlet-api.jar冲突的问题。 - CSDN博客
2. 使用的hadoop hdfs的版本是2.7.2,当es向hdfs中写入数据时报错:
-
ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
-
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
原因:
hadoop-core 1.2.1太老了,不支持2.7.2了
所以解决办法是:
修改依赖
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-common‘, version: ‘2.7.2‘
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-mapreduce-client-core‘, version: ‘2.7.2‘
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-client‘, version: ‘2.7.2‘
-
compile group: ‘org.apache.hadoop‘, name: ‘hadoop-hdfs‘, version: ‘2.7.2‘
-
compile group: ‘org.elasticsearch‘, name: ‘elasticsearch-hadoop‘, version: ‘6.2.4‘
参考:
intellij的maven工程"Server IPC version 9 cannot communicate with client version"错误的解决办法 -
以上是关于如何备份hadoop数据的主要内容,如果未能解决你的问题,请参考以下文章