corosync+pacemaker+crmsh实现高可用
Posted sky_551
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了corosync+pacemaker+crmsh实现高可用相关的知识,希望对你有一定的参考价值。
目录
1、引言及环境介绍
2、高可用环境部署
3、crmsh接口使用介绍
4、案例
5、总结
1、引言及环境介绍
在上一博文中介绍了一些关于高可用技术的理论基础知识,这一博文则是介绍corosync+pacemakcer这一高可用方案的安装部署,并会以实际的案例来演示高可用的实现,corosync提供集群的信息层(messaging layer)的功能,传递心跳信息和集群事务信息,pacemaker工作在资源分配层,提供资源管理器的功能,并以crmsh这个资源配置的命令接口来配置资源。在进入主题前先来介绍一下常见的开源高可用方案和这次环境搭建的系统环境。
常见的HA开源方案:
heartbeat v1 + haresources
heartbeat v 2 + crm
heartbeat v3 + cluster-glue + pacemaker
corosync + cluster-glue + pacemaker
cman + rgmanager
keepalived + script
此次测试的系统环境:
[root@nod1 tomcat]# cat /etc/issue CentOS release 6.4 (Final) Kernel \\r on an \\m [root@nod1 tomcat]# uname -r 2.6.32-358.el6.x86_64
两个节点都是采用相同的操作系统
2、高可用环境部署
[root@nod1 ~]# yum -y install pacemakcer corosync #pacemaker和corosync采用yum方式安装即可,前提是你要配置好yum源,注意:两个节点都要进行安装 [root@nod1 ~]# rpm -ql corosync /etc/corosync /etc/corosync/corosync.conf.example #主配置文件模板 /etc/corosync/corosync.conf.example.udpu /etc/corosync/service.d /etc/corosync/uidgid.d /etc/dbus-1/system.d/corosync-signals.conf /etc/rc.d/init.d/corosync /etc/rc.d/init.d/corosync-notifyd /etc/sysconfig/corosync-notifyd /usr/bin/corosync-blackbox /usr/libexec/lcrso /usr/libexec/lcrso/coroparse.lcrso /usr/libexec/lcrso/objdb.lcrso /usr/libexec/lcrso/quorum_testquorum.lcrso /usr/libexec/lcrso/quorum_votequorum.lcrso /usr/libexec/lcrso/service_cfg.lcrso /usr/libexec/lcrso/service_confdb.lcrso /usr/libexec/lcrso/service_cpg.lcrso /usr/libexec/lcrso/service_evs.lcrso /usr/libexec/lcrso/service_pload.lcrso /usr/libexec/lcrso/vsf_quorum.lcrso /usr/libexec/lcrso/vsf_ykd.lcrso /usr/sbin/corosync /usr/sbin/corosync-cfgtool /usr/sbin/corosync-cpgtool /usr/sbin/corosync-fplay /usr/sbin/corosync-keygen #为corosync生成authkey的命令,此命令是根据内核的熵池来生成认证文件的,如果熵池的随机性不足,则会运行此命令后一直卡着,此时用户只有不断的敲击键盘使产生足够的随机数后才能生成authkdy文件 /usr/sbin/corosync-notifyd /usr/sbin/corosync-objctl /usr/sbin/corosync-pload /usr/sbin/corosync-quorumtool /usr/share/doc/corosync-1.4.7 /usr/share/doc/corosync-1.4.7/LICENSE /usr/share/doc/corosync-1.4.7/SECURITY /usr/share/man/man5/corosync.conf.5.gz /usr/share/man/man8/confdb_keys.8.gz /usr/share/man/man8/corosync-blackbox.8.gz /usr/share/man/man8/corosync-cfgtool.8.gz /usr/share/man/man8/corosync-cpgtool.8.gz /usr/share/man/man8/corosync-fplay.8.gz /usr/share/man/man8/corosync-keygen.8.gz /usr/share/man/man8/corosync-notifyd.8.gz /usr/share/man/man8/corosync-objctl.8.gz /usr/share/man/man8/corosync-pload.8.gz /usr/share/man/man8/corosync-quorumtool.8.gz /usr/share/man/man8/corosync.8.gz /usr/share/man/man8/corosync_overview.8.gz /usr/share/snmp/mibs/COROSYNC-MIB.txt /var/lib/corosync /var/log/cluster
生成集群节点间的认证文件:
[root@nod1 ~]# corosync-keygen #生成认证文件 Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Press keys on your keyboard to generate entropy (bits = 80). #熵池随机性不足时一直卡在这里,这里可以另开窗口进行其他的配置
提供corosync的配置文件,利用模板生成:
[root@nod1 ~]# cd /etc/corosync [root@nod1 corosync]# cp corosync.conf.example corosync.conf [root@nod1 corosync]# ls corosync.conf.example service.d corosync.conf corosync.conf.example.udpu uidgid.d [root@nod1 corosync]# vim corosync.conf # Please read the corosync.conf.5 manual page compatibility: whitetank #表示兼容whitetank版本,其实是corosync 0.8之前的版本 totem { #定义集群环境下各corosync间通讯机制 version: 2 # secauth: Enable mutual node authentication. If you choose to # enable this ("on"), then do remember to create a shared # secret with "corosync-keygen". #secauth: off secauth: on #表示基于authkey的方式来验证各节点 threads: 0 #启动的线程数,0表示不启动线程机制,默认即可 # interface: define at least one interface to communicate # over. If you define more than one interface stanza, you must # also set rrp_mode. interface { #定义哪个接口来传递心跳信息和集群事务信息 # Rings must be consecutively numbered, starting at 0. ringnumber: 0 #表示心跳信息发出后能够在网络中转几圈,保持默认值即可 # This is normally the *network* address of the # interface to bind to. This ensures that you can use # identical instances of this configuration file # across all your cluster nodes, without having to # modify this option. bindnetaddr: 192.168.0.0 #绑定的网络地址 # However, if you have multiple physical network # interfaces configured for the same subnet, then the # network address alone is not sufficient to identify # the interface Corosync should bind to. In that case, # configure the *host* address of the interface # instead: # bindnetaddr: 192.168.1.1 # When selecting a multicast address, consider RFC # 2365 (which, among other things, specifies that # 239.255.x.x addresses are left to the discretion of # the network administrator). Do not reuse multicast # addresses across multiple Corosync clusters sharing # the same network. mcastaddr: 239.255.21.111 #监听的多播地址,不要使用默认 # Corosync uses the port you specify here for UDP # messaging, and also the immediately preceding # port. Thus if you set this to 5405, Corosync sends # messages over UDP ports 5405 and 5404. mcastport: 5405 #corosync间传递信息使用的端口,默认即可 # Time-to-live for cluster communication packets. The # number of hops (routers) that this ring will allow # itself to pass. Note that multicast routing must be # specifically enabled on most network routers. ttl: 1 #包的生存周期,保持默认即可 } } logging { # Log the source file and line where messages are being # generated. When in doubt, leave off. Potentially useful for # debugging. fileline: off # Log to standard error. When in doubt, set to no. Useful when # running in the foreground (when invoking "corosync -f") to_stderr: no # Log to a log file. When set to "no", the "logfile" option # must not be set. to_logfile: yes logfile: /var/log/cluster/corosync.log # Log to the system log daemon. When in doubt, set to yes. to_syslog: no #关闭日志发往syslog # Log debug messages (very verbose). When in doubt, leave off. debug: off # Log messages with time stamps. When in doubt, set to on # (unless you are only logging to syslog, where double # timestamps can be annoying). timestamp: on #打印日志时是否记录时间戳,会消耗较多的cpu资源 logger_subsys { subsys: AMF debug: off } } #新增加以下内容 service { ver: 0 name: pacemaker #表示以插件化方式启用pacemaker } aisexec { #运行openaix时所使用的用户及组,默认时也是采用root,可以不定义 user: root group: root }
当corosync-keygen命令顺利运行完成后,在/etc/corosync/目录下生成authkey认证文件:
[root@nod1 corosync]# ls authkey corosync.conf.example service.d corosync.conf corosync.conf.example.udpu uidgid.d [root@nod1 corosync]# scp authkey corosync.conf nod2.test.com:/etc/corosync/ #把认证文件与配置文件拷贝到另一节点 [root@nod1 corosync]# service corosync start #启动服务,不要忘记另一个节点也要把corosync服务启动
验证corosync服务是否正常启动,在集群环境应对每个服务器都要验证:
验证corosync是否启动成功:
[root@nod1 corosync]# grep -e "Corosync Cluster Engine" /var/log/cluster/corosync.log #查看corosync集群引擎是否启动 Jul 19 21:45:48 corosync [MAIN ] Corosync Cluster Engine (\'1.4.7\'): started and ready to provide service. [root@nod1 corosync]# grep -e "configuration file" /var/log/cluster/corosync.log #查看corosync的配置文件是否成功加载 Jul 19 21:45:48 corosync [MAIN ] Successfully read main configuration file \'/etc/corosync/corosync.conf\'.
查看定义的TOTEM接口是否启用:
[root@nod1 corosync]# grep "TOTEM" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Jul 19 21:45:48 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jul 19 21:45:48 corosync [TOTEM ] The network interface [192.168.0.201] is now up. Jul 19 21:45:48 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
验证启动时是否有错误:
[root@nod1 corosync]# grep "ERROR" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon. Jul 19 21:45:48 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of \'Clusters from Scratch\' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN #上边的错误信息可以忽略,这里报错的信息主要意思是说pacemaker是以插件的方式配置的,在以后的版本中将不再支持
验证pacemaker是否正常启动:
[root@nod1 corosync]# grep "pcmk_startup" /var/log/cluster/corosync.log Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: CRM: Initialized Jul 19 21:45:48 corosync [pcmk ] Logging: Initialized pcmk_startup Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615 Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Service: 9 Jul 19 21:45:48 corosync [pcmk ] info: pcmk_startup: Local hostname: nod1.test.com
3、crmsh接口使用介绍
pacemaker的配置接口有两种,一是crmsh,另一个是pcs,主里以crmsh的使用为例。
crmsh依赖pssh这个包,所以两个都需要分别在各个集群节点上进行安装,这两个包可以在这里进行下载http://crmsh.github.io/
[root@nod1 ~]# ls crmsh-2.1-1.6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm [root@nod1 ~]# yum install crmsh-2.1-1.6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm
crmsh的crm命令有两种模式:一种是命令模式,当执行一个命令,crmsh会把执行得到的结果输出到shell的标准输出;另一种是交互式模式;下边将有大量的例子来说明。
crm命令的使用:
[root@nod1 ~]# crm #直接使用crm进入交互式模式 crm(live)# crm(live)# help #查看帮助信息获取crm支持哪些子命令
crmsh常用的子命令:
status:查看集群的状态信息
configure:配置集群的命令
node:管理节点状态
ra:配置资源代理
resource:管理资源的子命令,比如关闭一个资源,清除资源的当前状态(比如一些出错信息)
接下来先查看一下集群的状态信息:
[root@nod1 ~]# crm crm(live)# status Last updated: Tue Jul 21 21:21:35 2015 Last change: Sun Jul 19 23:01:34 2015 Stack: classic openais (with plugin) #这里表示基于插件化的方式用openais中的corosync调用pacemaker来工作的 Current DC: nod1.test.com - partition with quorum #Designated Coordinate简称DC,表示指定的协调员,这里表示nod1.test.com就是集群中的事务协调员,“partition with quorum”就表示当前分区是拥有法定票数的 Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes #表示配置了2个节点,预计的投票数为2票 0 Resources configured #表示没有配置集群资源 Online: [ nod1.test.com nod2.test.com ] #这里显示两个节点都是在线的
查看集群默认的配置信息:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# show #使用show这个子命令就能查看当前集群的配置信息,使用“show xml”能以xml文件格式显示出当前的配置信息 node nod1.test.com node nod2.test.com property cib-bootstrap-options: \\ dc-version=1.1.11-97629de \\ cluster-infrastructure="classic openais (with plugin)" \\ expected-quorum-votes=2 \\ stonith-enabled=true \\ no-quorum-policy=stop \\ last-lrm-refresh=1436887216 crm(live)configure# verify #verify是检查配置文件是否有错误 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid #这里报了一些错误,表示默认时没有定义STONITH设备,在corosync+pacemaker的集群是不允许的,当然可以定义忽略这个检查,下边有介绍
用property子命令定义集群的全局属性:
[root@nod1 ~]# crm crm(live)configure# property #在crmsh接口中是支持tab键命令补全功能的,这里输入property后连续敲击两下tab键就可列出可配置的参数 batch-limit= maintenance-mode= remove-after-stop= cluster-delay= migration-limit= shutdown-escalation= cluster-recheck-interval= no-quorum-policy= start-failure-is-fatal= crmd-transition-delay= node-action-limit= startup-fencing= dc-deadtime= node-health-green= stonith-action= default-action-timeout= node-health-red= stonith-enabled= default-resource-stickiness= node-health-strategy= stonith-timeout= election-timeout= node-health-yellow= stop-all-resources= enable-acl= pe-error-series-max= stop-orphan-actions= enable-startup-probes= pe-input-series-max= stop-orphan-resources= is-managed-default= pe-warn-series-max= symmetric-cluster= load-threshold= placement-strategy= crm(live)configure# property stonith-enabled=false #把stonith设备的支持关闭,不然我们在想使用corosync的集群功能就需要定义stonith设备 crm(live)configure# show node nod1.test.com node nod2.test.com property cib-bootstrap-options: \\ dc-version=1.1.11-97629de \\ cluster-infrastructure="classic openais (with plugin)" \\ expected-quorum-votes=2 \\ stonith-enabled=false \\ #已是false状态 no-quorum-policy=stop \\ last-lrm-refresh=1436887216 crm(live)configure# verify #再校验配置就不会报错了 crm(live)configure# commit #提交配置
集群资源的配置
要想获取资源的详细信息就需要去ra(resource agent)中去查看,比如我们要定义一个虚拟ip资源:
[root@nod1 ~]# crm crm(live)# ra crm(live)ra# classes #查看集群资源有哪些类型 lsb ocf / heartbeat pacemaker service stonith crm(live)ra# list ocf #列出ocf这个类型下有哪些资源代理,下边就有IPaddr这个关于设置ip的资源代理 CTDB ClusterMon Delay Dummy Filesystem HealthCPU HealthSMART IPaddr IPaddr2 IPsrcaddr LVM MailTo Route SendArp Squid Stateful SysInfo SystemHealth VirtualDomain Xinetd apache conntrackd controld db2 dhcpd ethmonitor exportfs iSCSILogicalUnit mysql named nfsnotify nfsserver pgsql ping pingd postfix remote rsyncd symlink tomcat crm(live)ra# meta ocf:IPaddr #使用meta子命令能获取到一个资源的详细信息,即此资源的使用帮助信息
定义主资源用primitive命令:
[root@nod1 ~]# crm crm(live)#configure crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.100 crm(live)configure# verify crm(live)configure# commit #一旦提交成功,此资源就开始生效了 crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:14:43 2015 Last change: Tue Jul 21 22:12:44 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com #这里就是我们定义好的资源,在nod1.test.com节点启用了 [root@nod1 ~]# ip add show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000 link/ether 00:0c:29:07:89:fe brd ff:ff:ff:ff:ff:ff inet 192.168.0.201/24 brd 192.168.0.255 scope global eth0 inet 192.168.0.100/24 brd 192.168.0.255 scope global secondary eth0 inet6 fe80::20c:29ff:fe07:89fe/64 scope link valid_lft forever preferred_lft forever #我们定义的ip已生效
定义nginx的这个服务资源:
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# primitive nginx lsb:nginx #nginx这个服务是在lsb这个资源类别下的资源代理,primitive命令后的第一个nginx是定义集群资源的一个名称 crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:25:00 2015 Last change: Tue Jul 21 22:24:58 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod2.test.com #nginx这个资源在nod2.test.com节点启动起来了,这也验证了在高可用集群中集群会尽可能让资源分摊到各个节点的特性,而在实际环境中我们希望webip与nginx这两个资源是运行在同一个节点上的。
要想让多个资源同时运行在同一个节点上可以把多个资源定义在一个group中或定义排列约束(colocation):
[root@nod1 ~]# crm crm(live)# configure crm(live)configure# group webservice webip nginx crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd .. crm(live)# status Last updated: Tue Jul 21 22:30:19 2015 Last change: Tue Jul 21 22:30:17 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com #两个资源同时运行在nod1.test.com上了
接下验证资源是否能转移到其他节点上:
[root@nod1 ~]# crm node standby #把当前节点转换成standby状态 [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:37:14 2015 Last change: Tue Jul 21 22:37:09 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Node nod1.test.com: standby Online: [ nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #webservice组中的资源已转换到了nod2.test.com节点上
再让nod1.test.com重新上线,观察资源是否能转移回来:
[root@nod1 ~]# crm node online #让当前节点重新上线 You have new mail in /var/spool/mail/root [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:38:37 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #webservice组资源没有转换到nod1.test.com,是因为没有定义组对节点的倾向性
如果此时把nod2.test.com节点上的corosync服务停止,webservice这个组中的资源能够转换到nod1.test.com节点上吗?如下测试:
[root@nod2 ~]# service corosync stop Signaling Corosync Cluster Engine (corosync) to terminate: [确定] Waiting for corosync services to unload:. [确定] You have new mail in /var/spool/mail/root 在nod1.test.com节在上查看当前集群的状态: [root@nod1 ~]# crm status Last updated: Tue Jul 21 22:43:27 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com ] OFFLINE: [ nod2.test.com ]
从上边的输出信息可知资源并没有转移过来,为什么?仔细看上边的“Current DC: nod1.test.com - partition WITHOUT quorum ”表示当前分区没有法定的票数,所以此节点不会正常工作,资源当然不会转移过来。那如何解决这个问题,方案不止一个,一是可以增加一个ping node节点,二是可以增加一个仲裁磁盘,三是让集群中的节点数成奇数个,四是直接忽略当集群没有法定票数时直接忽略,第四种是最简单的方式,操作如下:
[root@nod2 ~]# service corosync start #先把nod2.test.com的corosync服务启动 Starting Corosync Cluster Engine (corosync): [确定] [root@nod1 ~]# crm crm(live)# status Last updated: Tue Jul 21 22:50:08 2015 Last change: Tue Jul 21 22:38:33 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com crm(live)configure# property no #敲击两下tab键后列出以no开头的可配置参数 no-quorum-policy= node-health-green= node-health-strategy= node-action-limit= node-health-red= node-health-yellow= crm(live)configure# property no-quorum-policy= #输入“no-quorum-policy=”再敲击两下tab键后列出一些帮助信息 no-quorum-policy (enum, [stop]): What to do when the cluster does not have quorum What to do when the cluster does not have quorum Allowed values: stop, freeze, ignore, suicide crm(live)configure# property no-quorum-policy=ignore #设置其值为"ignore" crm(live)configure# verify crm(live)configure# commit crm(live)configure# show #显示当前的配置信息 node nod1.test.com \\ attributes standby=off node nod2.test.com primitive nginx lsb:nginx primitive webip IPaddr \\ params ip=192.168.0.100 group webservice webip nginx property cib-bootstrap-options: \\ dc-version=1.1.11-97629de \\ cluster-infrastructure="classic openais (with plugin)" \\ expected-quorum-votes=2 \\ stonith-enabled=false \\ no-quorum-policy=ignore \\ last-lrm-refresh=1436887216 crm(live)# status Last updated: Tue Jul 21 22:54:00 2015 Last change: Tue Jul 21 22:51:10 2015 Stack: classic openais (with plugin) Current DC: nod1.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com nginx(lsb:nginx):Started nod1.test.com #当前资源已运行在nod1.test.com上
在nod1.test.com上停止corosync服务,再观察资源是否能转移到nod2.test.com上:
[root@nod1 ~]# service corosync stop Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] Waiting for corosync services to unload:. [ OK ] [root@nod2 ~]# crm #在nod2.test.com上进行crm管理接口 crm(live)# status Last updated: Tue Jul 21 22:56:52 2015 Last change: Tue Jul 21 22:52:25 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod2.test.com ] OFFLINE: [ nod1.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com #资源已成功转移到nod2.test.com上,所以在两个节点的高可用的环境,要设置“no-quorum-policy=ignore”,忽略节点的得到的法定票数不大于一半时的情况
如果是我们把在nod2.test.com上的nginx进程杀掉,集群资源会被转移到nod1.test.com上吗?如下测试:
[root@nod1 ~]# service corosync start #先把nod1.test.com上的corosync服务启动 Starting Corosync Cluster Engine (corosync): [ OK ] [root@nod1 ~]# crm status Last updated: Wed Jul 22 22:22:56 2015 Last change: Wed Jul 22 22:19:55 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com 再切换到nod2.test.com节点上杀掉nginx进程: [root@nod2 ~]# pgrep nginx 1798 1799 [root@nod2 ~]# killall nginx #杀掉nginx进程 [root@nod2 ~]# pgrep nginx #检验nginx进程是否被杀掉,没有任何信息输出表示nginx进程已不存在 [root@nod2 ~]# crm status Last updated: Wed Jul 22 22:26:09 2015 Last change: Wed Jul 22 22:19:55 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com nginx(lsb:nginx):Started nod2.test.com
上边查看集群状态时发现资源还是在nod2.test.com节点上,这在实际的生产环境中是不允许的,所以需要让集群能监控我们定义的资源,如果发现某资源不存在了,自己会尝试启动这一资源,如果尝试启动不成功,则会转移资源,下 边就来说说如何定义监控资源。
[root@nod2 ~]# service nginx start #先把上边杀掉的nginx启动起来 正在启动 nginx: [确定]
要定义资源的监控时也是在用全局定义命令primitive定义资源时一同定义,接下来我们先把之前定义的资源删掉后重新定义:
[root@nod1 ~]# crm crm(live)# resource crm(live)resource# show Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nginx(lsb:nginx):Started #进入资源管理命令可查看当前集群配置资源的情况,上边表示两个资源都是处理started状态 crm(live)resource# stop webservice #停掉webservice这个组中的所有资源,要删除资源,必须让资源处理stoppped状态 crm(live)resource# show Resource Group: webservice webip(ocf::heartbeat:IPaddr):Stopped nginx(lsb:nginx):Stopped crm(live)resource# cd .. crm(live)# configure crm(live)configure# edit #输入edit命令回车后会调用vi编辑器直接去编辑资源定义的配置文件,如下所示 node nod1.test.com \\ attributes standby=on node nod2.test.com primitive nginx lsb:nginx #这是定义的资源,需要删除 primitive webip IPaddr \\ #这是定义的资源,需要删除 params ip=192.168.0.100 group webservice webip nginx \\ #这是定义的资源,需要删除 meta target-role=Stopped property cib-bootstrap-options: \\ dc-version=1.1.11-97629de \\ cluster-infrastructure="classic openais (with plugin)" \\ expected-quorum-votes=2 \\ stonith-enabled=false \\ no-quorum-policy=ignore \\ last-lrm-refresh=1436887216 #vim:set syntax=pcmk
在上边打开的编辑窗口中删除我们自己定义的资源,再保存退出,最后保留的内容如下:
node nod1.test.com \\ attributes standby=on node nod2.test.com property cib-bootstrap-options: \\ dc-version=1.1.11-97629de \\ cluster-infrastructure="classic openais (with plugin)" \\ expected-quorum-votes=2 \\ stonith-enabled=false \\ no-quorum-policy=ignore \\ last-lrm-refresh=1436887216 #vim:set syntax=pcmk crm(live)configure# verify #检查语法 crm(live)configure# commit #提交配置 crm(live)resource# cd #回到根目录 crm(live)# status #查看集群状态 Last updated: Wed Jul 22 21:33:07 2015 Last change: Wed Jul 22 21:31:45 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 0 Resources configured Online: [ nod1.test.com nod2.test.com ]
从状态信息输出发现我们定义的资源已被删除了,现在开始重新定义带监控的资源:
crm(live)configure# primitive webip ocf:IPaddr params ip=192.168.0.100 op monitor timeout=20s interval=60s crm(live)configure# primitive webserver lsb:nginx op monitor timeout=20s interval=60s crm(live)configure# group webservice webip webserver crm(live)configure# verify crm(live)configure# commit crm(live)configure# cd crm(live)# status Last updated: Wed Jul 22 22:29:59 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com
这样带监控的资源就定义好了,上边在定义监控是的那些参数的意义可以在使用类似的命令查看“crm(live)ra# meta ocf:IPaddr”。现在我们再到nod1.test.com节点上把nginx杀掉,观察会发生什么现象:
[root@nod1 ~]# pgrep nginx 3056 3063 [root@nod1 ~]# killall nginx [root@nod1 ~]# pgrep nginx [root@nod1 ~]# pgrep nginx [root@nod1 ~]# pgrep nginx #等了几十秒后,nginx又被重新启动了 3337 3338
再看一下集群的状态信息,如下:
[root@nod1 ~]# crm status Last updated: Wed Jul 22 22:33:29 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod1.test.com webserver(lsb:nginx):Started nod1.test.com Failed actions: webserver_monitor_60000 on nod1.test.com \'not running\' (7): call=23, status=complete, last-rc-change=\'Wed Jul 22 22:32:02 2015\', queued=0ms, exec=0ms #这里报告了webserver这个资源没有运行
如果我们kill掉nginx后,让nginx无法启动,又是怎样一个情况呢,我们这样来测试,把nginx杀掉后,立刻去修改nginx的配置文件,随便增加一些行,让nginx的配置文件无法通过语法检测,这样自然nginx就无法启动了,说做就做:
[root@nod1 ~]# killall nginx [root@nod1 ~]# echo "test" >> /etc/nginx/nginx.conf [root@nod1 ~]# nginx -t nginx: [emerg] unexpected end of file, expecting ";" or "}" in /etc/nginx/nginx.conf:44 nginx: configuration file /etc/nginx/nginx.conf test failed [root@nod1 ~]# crm status Last updated: Wed Jul 22 22:37:42 2015 Last change: Wed Jul 22 22:28:01 2015 Stack: classic openais (with plugin) Current DC: nod2.test.com - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 2 Resources configured Online: [ nod1.test.com nod2.test.com ] Resource Group: webservice webip(ocf::heartbeat:IPaddr):Started nod2.test.com #看这里资源被转移到nod2.test.com了 webserver(lsb:nginx):Started nod2.test.com Failed actions: webserver_start_0 on nod1.test.com \'unknown error\' (1): call=30, status=complete, last-rc-change=\'Wed Jul 22 22:37:02 2015\', queued=0ms, exec=70ms #这里也报告一以上是关于corosync+pacemaker+crmsh实现高可用的主要内容,如果未能解决你的问题,请参考以下文章
Centos7上利用corosync+pacemaker+crmsh构建高可用集群
Corosync+Pacemaker+crmsh构建Web高可用集群
corosync+pacemaker+crmsh的高可用web集群的实现
corosync+pacemaker+crmsh+DRBD实现数据库服务器高可用集群构建