当实际的活动名称节点关闭时,HDFS HA 集群备用节点不会变为活动状态
Posted
技术标签:
【中文标题】当实际的活动名称节点关闭时,HDFS HA 集群备用节点不会变为活动状态【英文标题】:HDFS HA cluster Standby node doesn't become Active when the actual active namenode shutdown 【发布时间】:2017-02-24 09:04:32 【问题描述】:我已将 HDFS 配置为 HA 模式。我有一个“活动”节点和一个“备用”节点。我已经启动了 ZKFC。 如果我停止活动节点的 zkfc,则备用节点会更改状态并设置为“活动”节点。 问题是当我关闭启动 zkfc 的 Active 服务器和一台“Active”服务器和一台“Standby”服务器时,Standby 服务器不会改变他的状态,始终保持 Standby 状态。
我的 core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://auto-ha</value>
</property>
</configuration>
我的 hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.rpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the service RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.servicerpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTP server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.http-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTP server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.https-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTPS server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.https-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTPS server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>auto-ha</value>
</property>
<property>
<name>dfs.ha.namenodes.auto-ha</name>
<value>nn01,nn02</value>
</property>
<property>
<name>dfs.namenode.rpc-address.auto-ha.nn01</name>
<value>master1:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.auto-ha.nn01</name>
<value>master1:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.auto-ha.nn02</name>
<value>master2:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.auto-ha.nn02</name>
<value>master2:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master1:8485;master2:8485;master3:8485/auto-ha</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/hdfs/journalnode</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/ikerlan/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.auto-ha</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>master1:2181,master2:2181,master3:2181</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.auto-ha</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>
我检查了日志,问题是在尝试使用 Fence 时,我遇到了下一个失败:
2017-02-24 12:46:29,389 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master2/172.16.8.232:8020. Already tried 0 time$
2017-02-24 12:46:49,399 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at master2/172.16.8.232:8020 $
org.apache.hadoop.net.ConnectTimeoutException: Call From master1/172.16.8.231 to master2:8020 failed on socket timeout exception: org.$
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTran$
at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch :$
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
【问题讨论】:
【参考方案1】:我刚刚添加了下一个属性,现在它工作正常:
HDFS_SITE.XML
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
核心站点.XML
<property>
<name>hs.zookeeper.quorum</name>
<value>master1:2181,master2:2181,master3:2181</value>
</property>
问题是无法使用 sshfence 连接,因此使用 shell(/bin/true) 可以正常工作。
【讨论】:
以上是关于当实际的活动名称节点关闭时,HDFS HA 集群备用节点不会变为活动状态的主要内容,如果未能解决你的问题,请参考以下文章