一、Heartbeat的定义
Heartbeat 项目是 Linux-HA 工程的一个组成部分，也是目前开源HA项目中最成功的一个例子，Linux-HA的全称是High-Availability Linux，这个开源项目的目标是：通过社区开发者的共同努力，提供一个增强linux可靠性（reliability）、可用性（availability）和可服务性（serviceability）（RAS）的群集解决方案，它实现了一个高可用集群系统。心跳服务和集群通信是高可用集群的两个关键组件，在 Heartbeat 项目里，由 heartbeat 模块实现了这两个功能。
Linux-HA的官方网站：http://www.linux-ha.org http://hg.linux-ha.org

二、Heartbeat的版本与组件
说明：Heartbeat有三个版本分别为Heartbeat v1.x，Heartbeat v2.x，Heartbeat v3.x。Heartbeat v1.x和Heartbeat v2.x版本的组成结构十分简单，所有模块都集中在heartbeat中，到了v3版本后，整个heartbeat项目进行了拆分，分为不同的项目来分别进行开发。

1.Heartbeat v1.x与v2.x的组件
heartbeat：节点间通信检测模块

ha-logd：集群事件日志服务

CCM（Consensus Cluster Membership）：集群成员一致性管理模块

LRM （Local Resource Manager）：本地资源管理模块

Stonith Daemon：使出现问题的节点从集群环境中脱离或重启

CRM（Cluster resource management）：集群资源管理模块

Cluster policy engine：集群策略引擎

Cluster transition engine：集群转移引擎（也叫策略执行引擎）

v1.x与v2.x区别：

在v2.x中增加了一个新的集群资源管理器crm，在Heartbeat v1.x中的集群资源管理器是haresource，Heartbeat v2.x中为了兼容v1.x保留了haresource，但同时又新增了一个功能更强大的crm资源管理器。crm管理方式有，一种是基于命令行crmsh，一种是基于图形界面的hb_gui。

2.Heartbeat v3.x的组件
Heartbeat：

将原来的消息通信层独立为heartbeat项目，新的heartbeat只负责维护集群各节点的信息以及它们之前通信。

Cluster Glue：

相当于一个中间层，它用来将heartbeat和pacemaker关联起来，主要包含LRM和STONITH2个部分。

Resource Agent：

用来控制服务启停，监控服务状态的脚本集合，这些脚本将被LRM调用从而实现各种资源启动、停止、监控等等。

Pacemaker：

也就是CRM（集群资源管理器），用来管理整个HA的控制中心，客户端通过pacemaker来配置管理监控整个集群。

Pacemaker 提供了多种用户管理接口，分别如下：
(1).基于命令的管理方式：crmsh pcs

(2).基于图形界面的管理方式： pygui hawk LCMC pcs

官网详细说明：http://clusterlabs.org/

Pacemaker内部组成及与各模块之间关系，如下图：

技术分享

Heartbeat v3.x 内部组件之间的关系，如下图

技术分享

三、Heartbeat的各版本之间的区别

与v1.x风格相比，Heartbeat v2.x版本之后功能变化如下:

1.保留原有所有功能

2.自动监控资源

默认情况下每2分钟检测资源运行情况，如果发现资源不在，则尝试启动资源，如果60s后还未启动成功，则资源切换向另节点。时间可以修改。

3.可以对各资源组实现独立监控

比如apache运行在node1上，mysql运行在node2上，Heartbeat可同时实现两台主机的服务监控

4.同时监控系统负载

可以自动将资源切换到负载低的node上

5.新增crm资源管理器

crm管理器可以让heartbeat基于图形界面管理，即hb_gui

与v2.x风格相比，Heartbeat 3最主要变化是功能拆分，Heartbeat官方最后一个STABLE release 2.x 版本是2.1.4，Heartbeat 3官方正式发布的首个版本是3.0.2，Heartbeat 3与Heartbeat2.x的最大差别在于，Heartbeat3.x按模块把的原来Heartbeat2.x拆分为多个子项目，但是HA实现原理与Heartbeat2.x基本相同，配置也基本一致。

四、Heartbeat集群的一般拓扑图

1.主/从架构拓扑

说明：主/从（active/passive）主/从方式即指的是一台服务器处于某种业务的激活状态，另一台服务器处于该业务的备用状态；

技术分享

2.主/主架构拓扑

说明：主/主（active/active）双主机方式一种指两种不同业务分别在两台服务器上互为主从状态，另一种指两个主节点同时为一种业务服务，共享同一数据文件；

技术分享

----------------------------------------------------------------------------------

Heartbeat 是一个基于Linux开源的，被广泛使用的高可用集群系统。主要包括心跳服务和资源接管两个高可用集群组件。本文简要描述了在linux环境下安装heartbeat 2.1.4，同时描述了heartbeat的3个重要配置文件的配置方法。
有关heartbeat集群组件相关概念可参考： HeartBeat 集群组件概述

五、安装heartbeat

###准备安装文件
###由于heartbeat V2版本已经不再更新，V2版本最终版为2.1.4。
###对于需要在Linux对于需要在Linux 6下安装的可以从以下链接下载：
###对于Linux 5系列的可以在此下载：
###https://dl.fedoraproject.org/pub/epel/5/x86_64/repoview/letter_h.group.html
# rpm -Uvh PyXML-0.8.4-19.el6.x86_64.rpm
# rpm -Uvh perl-MailTools-2.04-4.el6.noarch.rpm
# rpm -Uvh perl-TimeDate-1.16-11.1.el6.noarch.rpm
# rpm -Uvh libnet-1.1.6-7.el6.x86_64.rpm
# rpm -Uvh ipvsadm-1.26-2.el6.x86_64.rpm
# rpm -Uvh lm_sensors-libs.x86_64 0:3.1.1-17.el6  
# rpm -Uvh net-snmp-libs.x86_64.rpm

# rpm -Uvh heartbeat-pils-2.1.4-12.el6.x86_64.rpm
# rpm -Uvh heartbeat-stonith-2.1.4-12.el6.x86_64.rpm
# rpm -Uvh heartbeat-2.1.4-12.el6.x86_64.rpm

###以下2个rpm包根据需要安装，一个是Heartbeat development package，一个是针对lvs
# rpm -Uvh heartbeat-devel-2.1.4-12.el6.x86_64.rpm       
# rpm -Uvh heartbeat-ldirectord-2.1.4-12.el6.x86_64.rpm  

###验证安装包
# rpm -qa |grep -i heartbeat
heartbeat-2.1.4-12.el6.x86_64
heartbeat-pils-2.1.4-12.el6.x86_64
heartbeat-stonith-2.1.4-12.el6.x86_64
heartbeat-ldirectord-2.1.4-12.el6.x86_64
heartbeat-devel-2.1.4-12.el6.x86_64

#复制样本配置文件到/etc/ha.d目录下并作相应修改
# cp /usr/share/doc/heartbeat-2.1.4/ha.cf /etc/ha.d/
# cp /usr/share/doc/heartbeat-2.1.4/haresources /etc/ha.d/
# cp /usr/share/doc/heartbeat-2.1.4/authkeys /etc/ha.d/

六、配置heartbeat

heartbeat配置主要由3个文件组成，一个是ha.cf，一个是authkeys，一个是haresources述。

1、ha.cf

###准备安装文件
###由于heartbeat V2版本已经不再更新，V2版本最终版为2.1.4。
###对于需要在Linux对于需要在Linux 6下安装的可以从以下链接下载：
###对于Linux 5系列的可以在此下载：
###https://dl.fedoraproject.org/pub/epel/5/x86_64/repoview/letter_h.group.html
# rpm -Uvh PyXML-0.8.4-19.el6.x86_64.rpm

示例文件描述
[[email protected] ha.d]# more ha.cf
#
# There are lots of options in this file. All you have to have is a set
# of nodes listed {"node ...} one of {serial, bcast, mcast, or ucast},
# and a value for "auto_failback".
# 必须设置的有节点列表集{node ...}，{serial,bcast,mcast,或ucast}中的一个，auto_failback的值
#
# ATTENTION: As the configuration file is read line by line,
# THE ORDER OF DIRECTIVE MATTERS!
# 配置文件是逐行读取的，并且选项的顺序是会影响最终结果的。
#
# In particular, make sure that the udpport, serial baud rate
# etc. are set before the heartbeat media are defined!
# debug and log file directives go into effect when they
# are encountered.
#
# 确保在udpport,serial baud rate在heartbeat检测前预先定义或预留可用
# 也就是是在定义网卡，串口等心跳检测接口前先要定义端口号。
#
# All will be fine if you keep them ordered as in this example.
# 如果保持本样例中的定义顺序，本配置将会正常工作。
#
# Note on logging:
# If all of debugfile, logfile and logfacility are not defined,
# logging is the same as use_logd yes. In other case, they are
# respectively effective. if detering the logging to syslog,
# logfacility must be "none".
# 记录日志方面的注意事项：
# 如果debugfile,logfile和logfacility都没有定义，日志记录就相当于use_logd yes。
# 否则，他们将分别生效。如果要阻止记录日志到syslog，那么logfacility必须设置为“none”
#
# File to write debug messages to
#写入debug消息的文件
#debugfile /var/log/ha-debug
#
#
# File to write other messages to
#
#单独指定日志文件
logfile /var/log/heartbeat.log
#
#
# Facility to use for syslog()/logger
#用于syslog()/logger的设备，通常情况下不建议与logfile同时启用
#logfacility local0
#
#
# A note on specifying "how long" times below...
#
# The default time unit is seconds
# 10 means ten seconds
#
# You can also specify them in milliseconds
# 1500ms means 1.5 seconds
#
#
# keepalive: how long between heartbeats?
#心跳时长
keepalive 2
#
# deadtime: how long-to-declare-host-dead?
#
# If you set this too low you will get the problematic
# split-brain (or cluster partition) problem.
# See the FAQ for how to use warntime to tune deadtime.
# 如果这个时间值设置得过长将导致脑裂或集群分区的问题。
#心跳丢失后死亡时长
#deadtime 30
#
# warntime: how long before issuing "late heartbeat" warning?
# See the FAQ for how to use warntime to tune deadtime.
#
#心跳丢失后警告时长
#warntime 10
#
#
# Very first dead time (initdead)
#
# On some machines/OSes, etc. the network takes a while to come up
# and start working right after you‘ve been rebooted. As a result
# we have a separate dead time for when things first come up.
# It should be at least twice the normal dead time.
# 在某些机器/操作系统等中，网络在机器启动或重启后需要花一定的时间启动并正常工作。
# 因此我们必须分开他们初次起来的dead time，这个值应该最少设置为两倍的正常dead time。
#
#初始死亡时长
#initdead 120
#
#
# What UDP port to use for bcast/ucast communication?
#
#端口号的配置
#udpport 694
#
# Baud rate for serial ports...
#
#波特率的配置
#baud 19200
#
# serial serialportname ...
#串口名称
#serial /dev/ttyS0 # Linux
#serial /dev/cuaa0 # FreeBSD
#serial /dev/cuad0 # FreeBSD 6.x
#serial /dev/cua/a # Solaris
#
#
# What interfaces to broadcast heartbeats over?
#
#广播的网络接口名称
#bcast eth0 # Linux
bcast eth0 # Linux
#bcast eth1 eth2 # Linux
#bcast le0 # Solaris
#bcast le1 le2 # Solaris
#
# Set up a multicast heartbeat medium
# mcast [dev] [mcast group] [port] [ttl] [loop]
#
# [dev] device to send/rcv heartbeats on
# [mcast group] multicast group to join (class D multicast address
# 224.0.0.0 - 239.255.255.255)
# [port] udp port to sendto/rcvfrom (set this value to the
# same value as "udpport" above)
# [ttl] the ttl value for outbound heartbeats. this effects
# how far the multicast packet will propagate. (0-255)
# Must be greater than zero.
# [loop] toggles loopback for outbound multicast heartbeats.
# if enabled, an outbound packet will be looped back and
# received by the interface it was sent on. (0 or 1)
# Set this value to zero.
#
#有关多播的配置
#mcast eth0 225.0.0.1 694 1 0
#
# Set up a unicast / udp heartbeat medium
# ucast [dev] [peer-ip-addr]
#
# [dev] device to send/rcv heartbeats on
# [peer-ip-addr] IP address of peer to send packets to
#
#
#ucast eth0 192.168.1.2
#
#对于广播，单播或多播，各有优缺点。
#单播多用于2节点情形，但是2节点上则不能使用相同的配置文件，因为ip地址不一样
#
#
# About boolean values... 关于boolean值
#
# 下面的任意不区分大小写敏感值将被当作true
# Any of the following case-insensitive values will work for true:
# true, on, yes, y, 1
# 下面的任意不区分大小写敏感值将被当作false
# Any of the following case-insensitive values will work for false:
# false, off, no, n, 0
#
#
#
#
# auto_failback: determines whether a resource will
# automatically fail back to its "primary" node, or remain
# on whatever node is serving it until that node fails, or
# an administrator intervenes.
# 决定一个resource是否自动恢复到它的初始primary节点，
# 或者继续运行在转移后的节点直到出现故障或管理员进行干预。
#
# The possible values for auto_failback are:
# on - enable automatic failbacks
# off - disable automatic failbacks
# legacy - enable automatic failbacks in systems
# where all nodes do not yet support
# the auto_failback option.
#
# auto_failback "on" and "off" are backwards compatible with the old
# "nice_failback on" setting.
#
# See the FAQ for information on how to convert
# from "legacy" to "on" without a flash cut.
# (i.e., using a "rolling upgrade" process)
#
# The default value for auto_failback is "legacy", which
# will issue a warning at startup. So, make sure you put
# an auto_failback directive in your ha.cf file.
# (note: auto_failback can be any boolean or "legacy")
#
#自动failback配置
auto_failback on
#
#
# Basic STONITH support
# Using this directive assumes that there is one stonith
# device in the cluster. Parameters to this device are
# read from a configuration file. The format of this line is:
#
# stonith <stonith_type> <configfile>
#
# NOTE: it is up to you to maintain this file on each node in the
# cluster!
#
#基本STONITH支持
#stonith baytech /etc/ha.d/conf/stonith.baytech
#
# STONITH support
# You can configure multiple stonith devices using this directive.
# The format of the line is:
# stonith_host <hostfrom> <stonith_type> <params...>
# <hostfrom> is the machine the stonith device is attached
# to or * to mean it is accessible from any host.
# <stonith_type> is the type of stonith device (a list of
# supported drives is in /usr/lib/stonith.)
# <params...> are driver specific parameters. To see the
# format for a particular device, run:
# stonith -l -t <stonith_type>
#
#
# Note that if you put your stonith device access information in
# here, and you make this file publically readable, you‘re asking
# for a denial of service attack ;-)
#
# To get a list of supported stonith devices, run
# stonith -L
# For detailed information on which stonith devices are supported
# and their detailed configuration options, run this command:
# stonith -h
#
#stonith_host * baytech 10.0.0.3 mylogin mysecretpassword
#stonith_host ken3 rps10 /dev/ttyS1 kathy 0
#stonith_host kathy rps10 /dev/ttyS1 ken3 0
#
# Watchdog is the watchdog timer. If our own heart doesn‘t beat for
# a minute, then our machine will reboot.
# NOTE: If you are using the software watchdog, you very likely
# wish to load the module with the parameter "nowayout=0" or
# compile it without CONFIG_WATCHDOG_NOWAYOUT set. Otherwise even
# an orderly shutdown of heartbeat will trigger a reboot, which is
# very likely NOT what you want.
#
#watchdog计时器的配置
#watchdog /dev/watchdog
#
# Tell what machines are in the cluster
# node nodename ... -- must match uname -n
#
#节点名称配置，重要，必须与uname -n获得的名字等同
#node ken3
#node kathy
node node1.jack.com
node node1.jack.com
#
# Less common options...
#
# Treats 10.10.10.254 as a psuedo-cluster-member
# Used together with ipfail below...
# note: don‘t use a cluster node as ping node
# 将10.10.10.254看成一个伪集群成员，与下面的 ipfail一起使用。
# 注意：不要使用一个集群节点作为ping节点，通常可以设置为Ping 网关。
# 此作用用于觉定集群重构的仲裁票数
#
#ping 10.10.10.254
ping 10.109.132.1
#
# Treats 10.10.10.254 and 10.10.10.253 as a psuedo-cluster-member
# called group1. If either 10.10.10.254 or 10.10.10.253 are up
# then group1 is up
# Used together with ipfail below...
# 同上，意思是两个IP当中，任意一个ping通即可
#
#ping_group group1 10.10.10.254 10.10.10.253
#
# HBA ping derective for Fiber Channel
# Treats fc-card-name as psudo-cluster-member
# used with ipfail below ...
#
# You can obtain HBAAPI from http://hbaapi.sourceforge.net. You need
# to get the library specific to your HBA directly from the vender
# To install HBAAPI stuff, all You need to do is to compile the common
# part you obtained from the sourceforge. This will produce libHBAAPI.so
# which you need to copy to /usr/lib. You need also copy hbaapi.h to
# /usr/include.
#
# The fc-card-name is the name obtained from the hbaapitest program
# that is part of the hbaapi package. Running hbaapitest will produce
# a verbose output. One of the first line is similar to:
# Apapter number 0 is named: qlogic-qla2200-0
# Here fc-card-name is qlogic-qla2200-0.
#
#hbaping fc-card-name
#
# Processes started and stopped with heartbeat. Restarted unless
# they exit with rc=100
# 指定当一个heartbeat服务或节点宕机时如何处理。
# 开启ipfail则是重启对应的节点，该进程被自动监视，遇到故障则重新启动。
# ipfail进程用于检测和处理网络故障，需要配合ping语句指定的ping node来检测网络连接。
#
#respawn userid /path/name/to/run
#respawn hacluster /usr/lib/heartbeat/ipfail
#
# Access control for client api
# default is no access
#
#apiauth client-name gid=gidlist uid=uidlist
#apiauth ipfail gid=haclient uid=hacluster

######################################
#
# Unusual options. 不常用选项
#
######################################
#
# hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
#
# deadping - dead time for ping nodes
#deadping 30
#
# hbgenmethod - Heartbeat generation number creation method
# Normally these are stored on disk and incremented as needed.
#hbgenmethod time
#
# realtime - enable/disable realtime execution (high priority, etc.)
# defaults to on
#realtime off
#
# debug - set debug level
# defaults to zero
#debug 1
#
# API Authentication - replaces the fifo-permissions-based system of the past
#
# You can put a uid list and/or a gid list.
# If you put both, then a process is authorized if it qualifies under either
# the uid list, or under the gid list.
#
# The groupname "default" has special meaning. If it is specified, then
# this will be used for authorizing groupless clients, and any client groups
# not otherwise specified.
#
# There is a subtle exception to this. "default" will never be used in the
# following cases (actual default auth directives noted in brackets)
# ipfail (uid=HA_CCMUSER) Author : Leshami
# ccm (uid=HA_CCMUSER) Blog : http://blog.csdn.net/leshami
# ping (gid=HA_APIGROUP)
# cl_status (gid=HA_APIGROUP)
#
# This is done to avoid creating a gaping security hole and matches the most
# likely desired configuration.
# 这避免生成一个安全漏洞缺口，可以实现能很多人最渴望的安全配置。
#
#apiauth ipfail uid=hacluster
#apiauth ccm uid=hacluster
#apiauth cms uid=hacluster
#apiauth ping gid=haclient uid=alanr,root
#apiauth default gid=haclient

# message format in the wire, it can be classic or netstring,
# default: classic
#msgfmt classic/netstring

# Do we use logging daemon?
# If logging daemon is used, logfile/debugfile/logfacility in this file
# are not meaningful any longer. You should check the config file for logging
# daemon (the default is /etc/logd.cf)
# more infomartion can be fould in http://www.linux-ha.org/ha_2ecf_2fUseLogdDirective
# Setting use_logd to "yes" is recommended
#
# use_logd yes/no
#
# the interval we reconnect to logging daemon if the previous connection failed
# default: 60 seconds
#conn_logd_time 60
#
#
# Configure compression module
# It could be zlib or bz2, depending on whether u have the corresponding
# library in the system.
#compression bz2
#
# Confiugre compression threshold
# This value determines the threshold to compress a message,
# e.g. if the threshold is 1, then any message with size greater than 1 KB
# will be compressed, the default is 2 (KB)
#compression_threshold 2

该文件是heartbeat的主要配置文件，大致包括如下信息：

heartbeat日志文件输出级别，位置；

心跳时长，告警时长，脑裂时长，初始化时长等；

心跳通讯方式，IP，端口号，串口设备，波特率等；

节点名称，隔离方式等。

2、authkeys 认证信息配置

该文件主要用于配置heartbeat的认证信息。共有三种可用的方式：crc、md5和sha1。
三种方式安全性依次提高，但同时占用的系统资源也依次扩大。
crc安全性最低，适用于物理上比较安全的网络，sha1提供最为有效的鉴权方式，占用的系统资源也最多。
该authkeys文件的文件其许可权应该设为600（即-rw-------）。命令为： chmod 600 authkeys

其配置语句格式如下：
auth <number>
<number> [desc]
举例说明：
    auth 1 
    1 sha1 key-for-sha1
    其中键值key-for-sha1可以任意指定，number设置必须保证上下一致。

    auth 2 
    2 crc
    crc方式不需要指定键值。

示例文件描述
[[email protected] ha.d]# more authkeys 
##       Authentication file.  Must be mode 600##       Must have exactly one auth directive at the front.#       auth    send authentication using this method-id##       Then, list the method and key that go with that method-id##       Available methods: crc sha1, md5.  Crc doesn‘t need/want a key.##       You normally only have one authentication method-id listed in this file##       Put more than one to make a smooth transition when changing auth#       methods and/or keys.###       sha1 is believed to be the "best", md5 next best.##       crc adds no security, except from packet corruption.#               Use only on physically secure networks.##auth 1#1 crc#2 sha1 HI!#3 md5 Hello!12345678910111213141516171819202122232425262728293031323334353637383940414243441234567891011121314151617181920212223242526272829303132333435363738394041424344

3、haresources 资源配置

haresources文件用于指定集群系统的节点、集群IP、子网掩码、广播地址以及启动的相关服务等。
其配置语句格式如下：
    node-name  network-config resouce-group 
node-name：指定集群系统的节点名称，取值必须匹配ha.cf文件中node选项设置的主机名中的相同。     
network-config：用于网络设置，包括指定集群IP、子网掩码、广播地址等。
resource-group：用于设置heartbeat管理的相关集群服务，也就是这些服务可以由Heartbeat来启动和关闭。
对于使用heartbeat接管的相关服务，必须将服务写成可以通过start/stop来启动和关闭的脚步，然后放到/etc /init.d/
或者/etc/ha.d/resource.d/目录下，Heartbeat(TE)会根据脚本的名称自动去上述目录下找到相应脚本进行启动或关闭操作。

示例描述：
node1 IPaddr::192.168.21.10/24/eth0/  Filesystem:: /dev/sdb2::/webdata::ext3  httpd tomcat 

node1:
    节点名称

IPaddr::192.168.21.10/24/eth0/
    IPaddr为heartbeat提供的一个脚本，位于/etc/ha.d/resource.d目录
    执行/etc/ha.d/resource.d/IPaddr 192.168.21.10/24 start的操作
    虚拟出一个子网掩码为255.255.255.0，IP为192.168.21.10的地址。
    此IP为Heartbeat对外提供服务的网络地址，同时指定此IP使用的网络接口为eth0

Filesystem:: /dev/sdb2::/webdata::ext3
    Filesystem为heartbeat提供的一个脚本，位于/etc/ha.d/resource.d目录
    执行共享磁盘分区的挂载操作，等同于命令行下的mount -t ext3 /dev/sdb2 /webdata

httpd tomcat
    依次启动httpd，以及tomcat服务

注：对于多个网络接口，不同子网的情行，IP地址，通常会使用别名绑定在跟VIP在同一网段内的网络接口上。
如： eth0 : 172.16.100.6  eth1 : 192.168.0.6 VIP : 172.16.100.5则VIP 会绑定在eth0上，因为2个地址在同一网段，由这个命令来完成/usr/lib64/heartbeat/findif

示例文件描述  
[[email protected] ha.d]# more haresources 
##       This is a list of resources that move from machine to machine as#       nodes go down and come up in the cluster.  Do not include#       "administrative" or fixed IP addresses in this file.#       集群中的节点停机和启动时，这里配置的资源列表会从一个节点转移到另一个节点， #       不过资源列表中不要包含管理或已经配置在服务器上的IP地址在这个文件中。# <VERY IMPORTANT NOTE>#       The haresources files MUST BE IDENTICAL on all nodes of the cluster.#       此haresources文件在所有的集群节点中都必须相同##       The node names listed in front of the resource group information#       is the name of the preferred node to run the service.  It is#       not necessarily the name of the current machine.  If you are running#       auto_failback ON (or legacy), then these services will be started#       up on the preferred nodes - any time they‘re up.##       列在resource组信息前的节点名称是主机的hostname，它不需要是当前机器的名称，如果你配置auto_failback on#      (或者legacy)，那么这些服务将会在首选的节点上启动，只要主机是运行的。##       If you are running with auto_failback OFF, then the node information#       will be used in the case of a simultaneous start-up, or when using#       the hb_standby {foreign,local} command.#       如果你配置的是auto_failback off，在集群重构或者使用hb_standby {foreign,local}命令，节点信息将被使用##       BUT FOR ALL OF THESE CASES, the haresources files MUST BE IDENTICAL.#       If your files are different then almost certainly something#       won‘t work right.#       但是对于所有的这些情况，此haresources文件都必须相同。如果你的文件不同那么肯定有某些功能将不能正常工作。# </VERY IMPORTANT NOTE>###       We refer to this file when we‘re coming up, and when a machine is being#       taken over after going down.#       在节点启动和一个节点停机后被接管的时候会参考这个文件。##       You need to make this right for your installation, then install it in#       /etc/ha.d#       安装时把它放到/etc/ha.d目录##       Each logical line in the file constitutes a "resource group".#       A resource group is a list of resources which move together from#       one node to another - in the order listed.  It is assumed that there#       is no relationship between different resource groups.  These#       resource in a resource group are started left-to-right, and stopped#       right-to-left.  Long lists of resources can be continued from line#       to line by ending the lines with backslashes ("\").##       在文件里面的每个逻辑行组成一个“resource group”，下简称资源组#       一个资源组就是从一个节点切换到另一个节点时的资源顺序列表。#       假定不同的资源组之间是没有关系的,资源组的启动时是从左到右的。关闭时是从右到左的。#       过长的resources列表可以以反斜杠（“\”）结尾来续行。##       These resources in this file are either IP addresses, or the name#       of scripts to run to "start" or "stop" the given resource.#       在这个文件里面的resources可以是IP地址，也可以是用于“start”或“stop”给定的resource的脚本名称##       The format is like this:#       样例# #node-name resource1 resource2 ... resourceN ###       If the resource name contains an :: in the middle of it, the#       part after the :: is passed to the resource script as an argument.#       Multiple arguments are separated by the :: delimeter#       如果资源的名称包含一个::在它的中间，在::后面的部分会传递给资源的脚本中作为一个参数，多个参数会以::分割。##       In the case of IP addresses, the resource script name IPaddr is#       implied.#       在IP地址的情况中，resource脚本名称IPaddr是隐含的。##       For example, the IP address 135.9.8.7 could also be represented#       as IPaddr::135.9.8.7#       例如：IP地址135.9.8.7也可以被表示为IPaddr::135.9.8.7##       THIS IS IMPORTANT!!     vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv##       The given IP address is directed to an interface which has a route#       to the given address.  This means you have to have a net route#       set up outside of the High-Availability structure.  We don‘t set it#       up here -- we key off of it.#       给定的IP地址会直接连到有路由到给定的地址的接口上，#       这也就意味着你必须要在 High-Availability 外部配置一个网络路由。##       The broadcast address for the IP alias that is created to support#       an IP address defaults to the highest address on the subnet.#       IP别名的广播地址将被缺省创建为支持IP地址的子网里的最高地址#       #       The netmask for the IP alias that is created defaults to the same#       netmask as the route that it selected in in the step above.#       IP别名的子网掩码将被缺省创建为与上一步骤选择的路由相同的子网掩码##       The base interface for the IPalias that is created defaults to the#       same netmask as the route that it selected in in the step above.#       IP别名的基础接口将被缺省创建为与上面选择的路由相同的子网掩码##       If you want to specify that this IP address is to be brought up#       on a subnet with a netmask of 255.255.255.0, you would specify#       this as IPaddr::135.9.8.7/24 .  #       上面为子网掩码指定示例##       If you wished to tell it that the broadcast address for this subnet#       was 135.9.8.210, then you would specify that this way:#               IPaddr::135.9.8.7/24/135.9.8.210#       上面为广播地址指定示例##       If you wished to tell it that the interface to add the address to#       is eth0, then you would need to specify it this way:#               IPaddr::135.9.8.7/24/eth0#       如果你希望指明要增加地址的接口是eth0，那么你需要像这样指定 IPaddr::135.9.8.7/24/eth0#       #       And this way to specify both the broadcast address and the#       interface:#               IPaddr::135.9.8.7/24/eth0/135.9.8.210#       同时指定广播地址和接口的方法为： IPaddr::135.9.8.7/24/eth0/135.9.8.210##       The IP addresses you list in this file are called "service" addresses,#       since they‘re they‘re the publicly advertised addresses that clients#       use to get at highly available services.#       这个文件中的IP地址列表，叫做服务地址，它们是客户端用于获取高可用服务的公共通告地址##       For a hot/standby (non load-sharing) 2-node system with only#       a single service address, #       you will probably only put one system name and one IP address in here.#       The name you give the address to is the name of the default "hot"#       system.#       对于一个双机热备（非共享负载）单服务地址的系统，你可能只需要放置一个系统名称和一个IP地址在这里。#       你指定的地址对应的名字就是缺省的"hot"系统的名字。##       Where the nodename is the name of the node which "normally" owns the#       resource.  If this machine is up, it will always have the resource#       it is shown as owning.#       节点名称就是正常情况下拥有resource的节点的名称。#       如果此机器是up的，他将一直拥有显示的resource。##       The string you put in for nodename must match the uname -n name#       of your machine.  Depending on how you have it administered, it could#       be a short name or a FQDN.#       节点名应当与uname -n查看的结果一致 #-------------------------------------------------------------------##       Simple case: One service address, default subnet and netmask#               No servers that go up and down with the IP address#       单服务地址，缺省子网和掩码，没有服务与IP地址一起启动和关闭##just.linux-ha.org      135.9.216.110##-------------------------------------------------------------------##       Assuming the adminstrative addresses are on the same subnet...#       A little more complex case: One service address, default subnet#       and netmask, and you want to start and stop http when you get#       the IP address...#       假定管理地址在相同的子网...#       稍微复杂一些的情况：一个服务地址，缺省子网和子网掩码，同时你想要获得IP地址的时候启动和停止http。##just.linux-ha.org      135.9.216.110 http#-------------------------------------------------------------------##       A little more complex case: Three service addresses, default subnet#       and netmask, and you want to start and stop http when you get#       the IP address...#       稍微复杂一些的情况：三个服务地址，缺省子网和掩码，同时你要在获得IP地址的时候启动和停止http。##just.linux-ha.org      135.9.216.110 135.9.215.111 135.9.216.112 httpd#-------------------------------------------------------------------##       One service address, with the subnet, interface and bcast addr#       explicitly defined.#       一个服务地址，显式指定子网，接口，广播地址##just.linux-ha.org      135.9.216.3/28/eth0/135.9.216.12 httpd##-------------------------------------------------------------------##       An example where a shared filesystem is to be used.#       Note that multiple aguments are passed to this script using#       the delimiter ‘::‘ to separate each argument.#       一个使用共享文件系统的例子#       需要注意用‘::‘分隔的多个参数被传递到了这个脚本##node1  10.0.0.170 Filesystem::/dev/sda1::/data1::ext2##       Regarding the node-names in this file:##       They must match the names of the nodes listed in ha.cf, which in turn#       must match the `uname -n` of some node in the cluster.  So they aren‘t#       virtual in any sense of the word.#123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223

七、使用集群的其他几个相关配置(具体描述略)

a、配置主机host解析
b、配置等效验证
c、高可用的相关服务配置（如httpd,myqld等)，关闭自启动
d、如需要用到共享存储，还应配置相关存储系统

八、实验

实验环境：

node1.jack.com 10.109.134.206 Linux6.0 64位

node2.jack.com 10.109.134.207 Linux6.0 64位

1. 设置IP,

2. 设置电脑名：

[[email protected] ~]# hostname node1.jack.com

[[email protected] ~]# vim /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=node1.jack.com

3. 验证：uname -n

4. 双机互信：

[[email protected] ~]# ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ‘‘ #空密码

Generating public/private rsa key pair.

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

79:66:5d:25:de:69:9e:76:97:42:47:32:f8:b1:14:e7 [email protected]

The key‘s randomart image is:

+--[ RSA 2048]----+

| .+oo.|

| . +*+.|

| o.=E.|

| . ..++ o|

| S + .. =o|

| + o o|

| |

+-----------------+

[[email protected] ~]# ls .ssh/id_rsa

.ssh/id_rsa

[[email protected] ~]# ssh-copy-id -i .ssh/id_rsa.pub [email protected]

The authenticity of host ‘10.109.134.207 (10.109.134.207)‘ can‘t be established.

RSA key fingerprint is da:6b:6d:ad:80:20:3b:5e:c4:32:28:15:5e:4b:a0:40.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added ‘10.109.134.207‘ (RSA) to the list of known hosts.

[email protected]‘s password:

Now try logging into the machine, with "ssh ‘[email protected]‘", and check in:

.ssh/authorized_keys

to make sure we haven‘t added extra keys that you weren‘t expecting.

[[email protected] ~]# ifconfig

eth1 Link encap:Ethernet HWaddr 00:50:56:B9:7A:17

inet addr:10.109.134.206 Bcast:10.109.135.255 Mask:255.255.252.0

...

[[email protected] ~]# ssh 10.109.134.207 ‘ifconfig‘

eth1 Link encap:Ethernet HWaddr 00:50:56:B9:22:FA

inet addr:10.109.134.207 Bcast:10.109.135.255 Mask:255.255.252.0

...

5. 域名解析：

[[email protected] ~]# vim /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

10.109.134.206 node1.jack.com node1

10.109.134.207 node2.jack.com node2

[[email protected] ~]# scp /etc/hosts node2:/etc/

hosts 100% 236 0.2KB/s 00:00

[[email protected] ~]# ping node2

PING node2.jack.com (10.109.134.207) 56(84) bytes of data.

64 bytes from node2.jack.com (10.109.134.207): icmp_seq=1 ttl=64 time=0.558 ms

[[email protected] ~]# iptables -F #清除所有规则

[[email protected] ~]# iptables -L #查看现已生效规则

6. 时间同步：

[[email protected] ~]# service ntpd stop

[[email protected] ~]# ntpdate 10.109.131.132

[[email protected] ~]# chkconfig ntpd off

[[email protected] ~]# service ntpd status

ntpd is stopped

[[email protected] ~]# chkconfig --list |grep ntpd

ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off

[[email protected] ~]# crontab -e

*/5 * * * * /sbin/ntpdate 10.109.131.132 &> /dev/null

[[email protected] ~]# scp /var/spool/cron/root node1:/var/spool/cron/ #将文件复制到另一主机

root 100% 54 0.1KB/s 00:00

至此镜相系统

7. 安装HA软件：

[[email protected] ha]# yum --nogpgcheck localinstall perl-MailTools-2.04-4.el6.noarch.rpm

[[email protected] ha]# yum --nogpgcheck localinstall libnet-1.1.6-7.el6.x86_64.rpm

[[email protected] ha]# yum --nogpgcheck localinstall heartbeat-pils-2.1.4-12.el6.x86_64.rpm

[[email protected] ha]# yum --nogpgcheck localinstall heartbeat-stonith-2.1.4-12.el6.x86_64.rpm

[[email protected] ha]# yum --nogpgcheck localinstall heartbeat-2.1.4-12.el6.x86_64.rpm

[[email protected] ha]# yum --nogpgcheck localinstall heartbeat-gui-2.1.4-12.el6.x86_64.rpm

heartbeat-debuginfo-2.1.4-12.el6.x86_64.rpm

heartbeat-devel-2.1.4-12.el6.x86_64.rpm

heartbeat-ldirectord-2.1.4-12.el6.x86_64.rpm

[[email protected] ha]# scp perl-MailTools-2.04-4.el6.noarch.rpm heartbeat-2.1.4-12.el6.x86_64.rpm heartbeat-pils-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm libnet-1.1.6-7.el6.x86_64.rpm heartbeat-gui-2.1.4-12.el6.x86_64.rpm node2:/root/ha/

[[email protected]node2 ha]# yum --nogpgcheck localinstall *.rpm

8. 启动heartbeat服务： haresources （另一种管理方式：CRM，此方式带图形化接口）

[[email protected] ha]# cd /etc/ha.d/

[[email protected] ha.d]#cp /usr/share/doc/heartbeat-2.1.4/{authkeys,ha.cf,haresources} ./

[[email protected] ~]# dd if=/dev/random count=1 bs=512 | md5sum #生成随机数

0+1 records in

0+1 records out

16 bytes (16 B) copied, 5.1656e-05 s, 310 kB/s

5a7d268d0bd83ba4c60ff641565ad7b8 -

[[email protected] ha.d]# vim authkeys

auth 1 #使用标记为1的加密算法，使用一个随机数

1 md5 5a7d268d0bd83ba4c60ff641565ad7b8

[[email protected] ha.d]# vim ha.cf

logfile /var/log/heartbeat.log

bcast eth0 # Linux

node node1.jack.com
node node1.jack.com
ping 10.109.132.1

[[email protected] ha.d]# vim haresources

[[email protected] ha.d]# ls /usr/lib64/heartbeat/ #heartbeat管理相关的很多命令

...

clmtest findif ipctest pingd utillib.sh

crm_commands.py ha_config ipctransientclient plugins

[[email protected] ~]# yum -y install httpd

[[email protected] ~]# echo "<h1>node1.jack.com<h1>" >> /var/www/html/index.html

[[email protected] ~]# service httpd start #httpd服务启动后测试WEB访问正常

[[email protected] ~]# service httpd stop

Stopping httpd: [ OK ]

[[email protected] ~]# chkconfig httpd off #httpd服务必须不能自动启动

[[email protected] ~]# vim /etc/ha.d/resources

...

node1.jack.com IPaddr::10.109.134.205/22/eth0 httpd #增加此行

[[email protected] ha.d]# scp -p authkeys haresources ha.cf node2:/etc/ha.d/

authkeys 100% 691 0.7KB/s 00:00

haresources 100% 5957 5.8KB/s 00:00

ha.cf 100% 10KB 10.4KB/s 00:00

[[email protected] ha.d]# service heartbeat start

Starting High-Availability services:

2017/09/07_15:42:31 INFO: Resource is stopped

Done.

[[email protected] ha.d]# ssh node2 ‘service heartbeat start‘ #必须远程启动主节点之外的节

Starting High-Availability services:

2017/09/07_15:43:23 INFO: Resource is stopped

Done.

9. 启动NFS共享空间服务

[[email protected] ~]# cat /etc/exports #此主机IP:10.109.134.204

/web/htdocs 10.109.132.0/22(ro)

[[email protected] ~]# mkdir /web/htdocs

mkdir: cannot create directory `/web/htdocs‘: No such file or directory

[[email protected] ~]# mkdir /web/htdocs -pv

mkdir: created directory `/web‘

mkdir: created directory `/web/htdocs‘

[[email protected] ~]# vim /etc/exports

[[email protected] ~]# service nfs restart

[r[email protected] heartbeat]# setenforce 0

[[email protected] heartbeat]# mount 10.109.134.204:/web/htdocs /mnt

[[email protected] heartbeat]# cat /mnt/index.html

nfs server

[[email protected] ~]# setenforce 0

[[email protected] ha.d]# vim haresources #此主机IP:10.109.134.206

...

node1.jack.com IPaddr::10.109.134.205/22/eth0 Filesystem::10.109.134.204:/web/htdocs::/var/www/html::nfs httpd

[[email protected] htdocs]# mount #此主机IP:10.109.134.204

...

nfsd on /proc/fs/nfsd type nfsd (rw)

---end---

本文出自 “风过无痕” 博客，请务必保留此出处http://wangfx.blog.51cto.com/1697877/1963712

以上是关于Linux 高可用（HA）集群之Heartbeat安装的主要内容，如果未能解决你的问题，请参考以下文章