when I pull the ethernet plug on my primary node, stage-monitor, heartbeat
doesn't failover the resource to the other director. If I manually stop the
service it seems to work fine. Is this expected? Can someone elaborate?
ha.cf listed below.
Also, I've brainstormed a bit and thought of a few plausible scenarios for
failure. Could someone go through them real quick?
Thanks!
Peter
---------------------
Possible failures of LVS system to test (find out what happens) :
Director failure:
1.) network fails on NIC in some way - cable, switch, or card
test - disconnect network cable on ACTIVE and PASSIVE director while a load
is active. Measure the time it takes for failover to occur and what happens
to current resources (ie active connections)
result - VIP dies. secondary director doesn't take over!
2.) serial cable is disconnected or "goes bad"
test - disconnect serial cable. does failover message switch to ETH/UDP?
is service interrupted?
3.) either director dies (nice recovery?)
test - shut off primary & failover directors and monitor the time it takes
for failover to occur and what happens to current resources.
4.) software stops running for some reason on either director
test - kill all relevant software (test for each program) and see what
happens.
Real server failure:
1.) network fails on NIC in some way - cable, switch, or card
test - disconnect the network and see what the reaction of LVS is. observe
how connections flow on lVS (mac address problem?).
2.) apache dies
test - stop or kill apache. see how LVS reacts (mac address problem?)
3.) Tomcat dies
test - stop or kill apache. see how LVS reacts (mac address problem?)
<ha.cf>
[root@stage-monitor ha.d]# more ha.cf
# File to wirte debug messages to
debugfile /var/log/ha-debug
# File to write other messages to
logfile /var/log/ha-log
# Facility to use for syslog()/logger
logfacility local0
# keepalive: how many seconds between heartbeats
keepalive 1
# deadtime: seconds-to-declare-host-dead
deadtime 3
# initdead: added per mailing list archive
#initdead 40
# hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
# serial serialportname ...
serial /dev/ttyS0
# Only for serial ports. It applies to both PPP/UDP and "raw" ports
# This means run PPP over ports ttyS1 and ttyS2
# Their respective IP addresses are as listed.
# Note that I enforce that these are local addresses. Other addresses
# are almost certainly a mistake.
#ppp-udp /dev/ttyS1 10.0.0.1 /dev/ttyS2 10.0.0.2
# Baud rate for both serial and ppp-udp ports...
#baud 19200
# What UDP port to use for udp or ppp-udp communication?
#udpport 1001
# What interfaces to heartbeat over?
udp eth0
# Watchdog is the watchdog timer. If our own heart doesn't beat for
# a minute, then our machine will reboot.
#watchdog /dev/watchdog
# Nice_failback sets the behavior when performing a failback:
#
# - if it's on, when the primary node starts or comes back from any
# failure and the cluster is already active, i.e. the secondary
# server performed a failover, the primary stays quiet, acting as a
# secondary. This way some operations like syncing disks can be
# easily done.
# - if it's off (default), the primary node will always be the
primary,
# whenever it's powered on.
nice_failback off
# Tell what machines are in the cluster
# node nodename ... -- must match uname -n
node stage-monitor
node vs1.internal.smartbasket.com
|