This document is a mini how-to get heartbeat working between two individually working LVS boxes.  It is certainly not intended to be all-encompasing document detailing everything imagineable.  What it is intended to deliver is an 'essential steps' to getting LVS-HA functional.  And you definitely should have two individually functioning boxes before even attempting this.  (Yes, go back and test your setup with each box to insure it works!).

Another important note to add is that I have only tested this setup with Ultramonkey RPMs.  I don't know if your setup will work.  I wouldn't trust this document unless you do the same.  (I would be interested in knowing if the HA features are the same for all 'heartbeat' setups..)

PS. - apologies if this document is RedHat biased, I'm running from VALinux boxes that are RedHat configured.

--
--
1.) Fix the ethernet alias (possible) issue.

By now you've setup a dummy alias device on each LVS box (most likely eth0:0).  This alias device is unecessary and potentially problematic in the HA-setup.  The reason for this is that the heartbeat software (/etc/ha.d/resource.d/) actually creates a new eth0:0 device on the active box. If you have an eth0:0 (or whatever) alias configured for your VIP on the standby director box, you might get a " VSbox2 kernel: Uh Oh, MAC address 00:02:B3:03:9A:13 claims to have our IP address (vip.ip.goes.here) (duplicate IP conflict likely)" error!  Not good...

If I were you I'd move your alias script out of your /etc/sysconfig/network-scripts/ directory and restart networking to clear out that alias.

2.) Configure /etc/ha.d/. files.

a.) authkeys

authkeys MUST be permission-set to 600 or 400 from what I have read.  Be sure this is the case.  authkeys should contain something like : 
auth 2
#1 crc
2 sha1 passwordhere
#3 md5 Hello!

Since you want to make sure this file is the same on both machines, get it setup on one box and scp or ftp the file over to the other.

b.) haresources

haresources is convoluted to understand until you have a working setup.  The example config show things like :
#just.linux-ha.org	135.9.216.110 http
when something like :
primary.director.box.goes.here shared.resources.address.here http
#vs1.foo.com	vip.foo.com http	# <-- put actual IP down instead of vip.foo.com
vs1.foo.com IPaddr::10.10.10.10 ldirectord::ldirectord.cf		# <-- if you use ldirector like this


It's important to note that the box listed in the first box is considered the 'primary' director box and usually takes control in the event of uncertainty.  (Definitely look at nice_failback in ha.cf if you're interested in this thread).

c.) ha.cf

high-availability configuration file.  yep, looks like the meat of the subject!  I'll just post my config, which assumes you use ttyS0 and eth0 for your links to the other director.

#       File to wirte debug messages to
debugfile /var/log/ha-debug
#       File to write other messages to
logfile /var/log/ha-log
#       Facility to use for syslog()/logger 
logfacility     local0
#       keepalive: how many seconds between heartbeats
keepalive 1
#       deadtime: seconds-to-declare-host-dead
deadtime 20
#       hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
#       serial  serialportname ...
serial  /dev/ttyS0
#       Only for serial ports.  It applies to both PPP/UDP and "raw" ports
#       This means run PPP over ports ttyS1 and ttyS2
#       Their respective IP addresses are as listed.
#       Note that I enforce that these are local addresses.  Other addresses
#       are almost certainly a mistake.
#ppp-udp        /dev/ttyS1 10.0.0.1 /dev/ttyS2 10.0.0.2
#       Baud rate for both serial and ppp-udp ports...
baud    19200
#       What UDP port to use for udp or ppp-udp communication?
udpport 1001
#       What interfaces to heartbeat over?
udp     eth0
#       Watchdog is the watchdog timer.  If our own heart doesn't beat for
#       a minute, then our machine will reboot.
#watchdog /dev/watchdog
#       Nice_failback sets the behavior when performing a failback:
#
#       - if it's on, when the primary node starts or comes back from any
#         failure and the cluster is already active, i.e. the secondary
#         server performed a failover, the primary stays quiet, acting as a
#         secondary.  This way some operations like syncing disks can be
#         easily done.
#       - if it's off (default), the primary node will always be the primary,
#         whenever it's powered on.
nice_failback off		# <-- might want to turn this on after you get things working
#       Tell what machines are in the cluster
#       node    nodename ...    -- must match uname -n
node    vs1.foo.com	# <-- must match uname -n !
node    vs2.foo.com	# <-- must match uname -n !

4.) Stop ldirectord from starting, ensure heartbeat starts on reboot.

/etc/rc.d/init.d/ldirectord stop.
/usr/sbin/chkconfig --level 2345 ldirectord off
/usr/sbin/chkconfig --level 345 heartbeat on # <-- run on whatever init levels you want

5.) Now the critical part.. starting heartbeat and verifying functionality!  At this point you should have linux-director NOT running on both boxes.  If you type ipvsadm -L on either box you should get:
[root@vs1 ha.d]# ipvsadm -L
IP Virtual Server version 0.9.11 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn

Now start up heartbeat.  tail /var/log/messages, and /var/log/ha-log for important log information.  My /var/log/messages looks like :

Apr 24 13:12:38 vs1 heartbeat[2070]: Configuration validated. Starting heartbeat.
Apr 24 13:12:39 vs1 heartbeat[2075]: Starting serial heartbeat on tty /dev/ttyS0
Apr 24 13:12:39 vs1 heartbeat[2075]: UDP heartbeat started on port 1001 interface eth0
Apr 24 13:12:39 vs1 heartbeat[2077]: node vs1.internal.smartbasket.com -- link eth0: status up
Apr 24 13:12:39 vs1 heartbeat[2077]: node stage-monitor -- link /dev/ttyS0: status up
Apr 24 13:12:39 vs1 heartbeat[2077]: node stage-monitor -- link eth0: status up

And a quick check of ifconfig on the primary director shows the alias interface (eth0:0) appears.  Note that eth0:0 is *NOT* present when heartbeat isn't running.

[root@vs1 ha.d]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:02:B3:06:B6:45  
          inet addr:10.0.1.5  Bcast:10.0.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:106550 errors:0 dropped:0 overruns:0 frame:0
          TX packets:75338 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          Interrupt:10 Base address:0xd000 

eth0:0    Link encap:Ethernet  HWaddr 00:02:B3:06:B6:45  
          inet addr:10.0.1.10  Bcast:10.0.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:10 Base address:0xd000 

A ps aux on the active director shows :

root      1648  0.0  0.1  1444  868 ttyS0    SL   13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1650  0.0  0.1  1332  748 ttyS0    SL   13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1651  0.0  0.1  1332  736 ttyS0    SL   13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1652  0.0  0.1  1328  736 ttyS0    S    13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1653  0.0  0.1  1332  732 ttyS0    SL   13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1654  0.0  0.1  1328  728 ttyS0    S    13:17   0:00 /usr/lib/heartbeat/heartbeat
root      1775  0.0  0.8  5352 4388 ttyS0    S    13:17   0:00 perl /etc/ha.d/resource.d/ldirectord ldir
root      1869  0.0  0.1  2344  724 pts/0    R    13:20   0:00 ps aux


6.) Test your fail-over features, understand HA.
At this point you should test around your failover functionality and learn how your setup works.  You also need to customize your ha.cf file to the specifications for your site.

--
--

Good luck!  Please send all updates to this document to the appropriate people for corrections!

Cheers,

Peter Mueller