This document is a mini how-to get heartbeat working between two individually working LVS boxes. It is certainly not intended to be all-encompasing document detailing everything imagineable. What it is intended to deliver is an 'essential steps' to getting LVS-HA functional. And you definitely should have two individually functioning boxes before even attempting this. (Yes, go back and test your setup with each box to insure it works!). Another important note to add is that I have only tested this setup with Ultramonkey RPMs. I don't know if your setup will work. I wouldn't trust this document unless you do the same. (I would be interested in knowing if the HA features are the same for all 'heartbeat' setups..) PS. - apologies if this document is RedHat biased, I'm running from VALinux boxes that are RedHat configured. -- -- 1.) Fix the ethernet alias (possible) issue. By now you've setup a dummy alias device on each LVS box (most likely eth0:0). This alias device is unecessary and potentially problematic in the HA-setup. The reason for this is that the heartbeat software (/etc/ha.d/resource.d/) actually creates a new eth0:0 device on the active box. If you have an eth0:0 (or whatever) alias configured for your VIP on the standby director box, you might get a " VSbox2 kernel: Uh Oh, MAC address 00:02:B3:03:9A:13 claims to have our IP address (vip.ip.goes.here) (duplicate IP conflict likely)" error! Not good... If I were you I'd move your alias script out of your /etc/sysconfig/network-scripts/ directory and restart networking to clear out that alias. 2.) Configure /etc/ha.d/. files. a.) authkeys authkeys MUST be permission-set to 600 or 400 from what I have read. Be sure this is the case. authkeys should contain something like : auth 2 #1 crc 2 sha1 passwordhere #3 md5 Hello! Since you want to make sure this file is the same on both machines, get it setup on one box and scp or ftp the file over to the other. b.) haresources haresources is convoluted to understand until you have a working setup. The example config show things like : #just.linux-ha.org 135.9.216.110 http when something like : primary.director.box.goes.here shared.resources.address.here http #vs1.foo.com vip.foo.com http # <-- put actual IP down instead of vip.foo.com vs1.foo.com IPaddr::10.10.10.10 ldirectord::ldirectord.cf # <-- if you use ldirector like this It's important to note that the box listed in the first box is considered the 'primary' director box and usually takes control in the event of uncertainty. (Definitely look at nice_failback in ha.cf if you're interested in this thread). c.) ha.cf high-availability configuration file. yep, looks like the meat of the subject! I'll just post my config, which assumes you use ttyS0 and eth0 for your links to the other director. # File to wirte debug messages to debugfile /var/log/ha-debug # File to write other messages to logfile /var/log/ha-log # Facility to use for syslog()/logger logfacility local0 # keepalive: how many seconds between heartbeats keepalive 1 # deadtime: seconds-to-declare-host-dead deadtime 20 # hopfudge maximum hop count minus number of nodes in config #hopfudge 1 # serial serialportname ... serial /dev/ttyS0 # Only for serial ports. It applies to both PPP/UDP and "raw" ports # This means run PPP over ports ttyS1 and ttyS2 # Their respective IP addresses are as listed. # Note that I enforce that these are local addresses. Other addresses # are almost certainly a mistake. #ppp-udp /dev/ttyS1 10.0.0.1 /dev/ttyS2 10.0.0.2 # Baud rate for both serial and ppp-udp ports... baud 19200 # What UDP port to use for udp or ppp-udp communication? udpport 1001 # What interfaces to heartbeat over? udp eth0 # Watchdog is the watchdog timer. If our own heart doesn't beat for # a minute, then our machine will reboot. #watchdog /dev/watchdog # Nice_failback sets the behavior when performing a failback: # # - if it's on, when the primary node starts or comes back from any # failure and the cluster is already active, i.e. the secondary # server performed a failover, the primary stays quiet, acting as a # secondary. This way some operations like syncing disks can be # easily done. # - if it's off (default), the primary node will always be the primary, # whenever it's powered on. nice_failback off # <-- might want to turn this on after you get things working # Tell what machines are in the cluster # node nodename ... -- must match uname -n node vs1.foo.com # <-- must match uname -n ! node vs2.foo.com # <-- must match uname -n ! 4.) Stop ldirectord from starting, ensure heartbeat starts on reboot. /etc/rc.d/init.d/ldirectord stop. /usr/sbin/chkconfig --level 2345 ldirectord off /usr/sbin/chkconfig --level 345 heartbeat on # <-- run on whatever init levels you want 5.) Now the critical part.. starting heartbeat and verifying functionality! At this point you should have linux-director NOT running on both boxes. If you type ipvsadm -L on either box you should get: [root@vs1 ha.d]# ipvsadm -L IP Virtual Server version 0.9.11 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn Now start up heartbeat. tail /var/log/messages, and /var/log/ha-log for important log information. My /var/log/messages looks like : Apr 24 13:12:38 vs1 heartbeat[2070]: Configuration validated. Starting heartbeat. Apr 24 13:12:39 vs1 heartbeat[2075]: Starting serial heartbeat on tty /dev/ttyS0 Apr 24 13:12:39 vs1 heartbeat[2075]: UDP heartbeat started on port 1001 interface eth0 Apr 24 13:12:39 vs1 heartbeat[2077]: node vs1.internal.smartbasket.com -- link eth0: status up Apr 24 13:12:39 vs1 heartbeat[2077]: node stage-monitor -- link /dev/ttyS0: status up Apr 24 13:12:39 vs1 heartbeat[2077]: node stage-monitor -- link eth0: status up And a quick check of ifconfig on the primary director shows the alias interface (eth0:0) appears. Note that eth0:0 is *NOT* present when heartbeat isn't running. [root@vs1 ha.d]# ifconfig -a eth0 Link encap:Ethernet HWaddr 00:02:B3:06:B6:45 inet addr:10.0.1.5 Bcast:10.0.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:106550 errors:0 dropped:0 overruns:0 frame:0 TX packets:75338 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 Interrupt:10 Base address:0xd000 eth0:0 Link encap:Ethernet HWaddr 00:02:B3:06:B6:45 inet addr:10.0.1.10 Bcast:10.0.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:10 Base address:0xd000 A ps aux on the active director shows : root 1648 0.0 0.1 1444 868 ttyS0 SL 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1650 0.0 0.1 1332 748 ttyS0 SL 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1651 0.0 0.1 1332 736 ttyS0 SL 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1652 0.0 0.1 1328 736 ttyS0 S 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1653 0.0 0.1 1332 732 ttyS0 SL 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1654 0.0 0.1 1328 728 ttyS0 S 13:17 0:00 /usr/lib/heartbeat/heartbeat root 1775 0.0 0.8 5352 4388 ttyS0 S 13:17 0:00 perl /etc/ha.d/resource.d/ldirectord ldir root 1869 0.0 0.1 2344 724 pts/0 R 13:20 0:00 ps aux 6.) Test your fail-over features, understand HA. At this point you should test around your failover functionality and learn how your setup works. You also need to customize your ha.cf file to the specifications for your site. -- -- Good luck! Please send all updates to this document to the appropriate people for corrections! Cheers, Peter Mueller