LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Split Brain issue when certain director is in charge

To: "'LinuxVirtualServer.org users mailing list.'" <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Split Brain issue when certain director is in charge
From: "Dan Brown" <danb@xxxxxx>
Date: Tue, 6 Mar 2007 13:58:47 -0600
I have a pair of servers running in a streamlined high-availability
load-balancing setup using UltraMonkey 3.  I am finding however that when a
certain director (on server nitehawk) is in charge, it causes a split brain
issue between the two servers as the other server (seahawk) will come up and
try to take over resources.  It will run ok for a while (like twenty minutes
to an hour) but eventually of course things run amuck.  When the other
director is in charge, the other director (on nitehawk) will wait patiently
like it's supposed to and not attempt a takeover unless the other server
(seahawk) drops out.

They are connected via crossover ethernet and serial cable.  The ethernet
link is also used for csync2 file replication and mysql replication.  As you
can see, it appears a single packet lost (7th line down) causes the other
server to attempt a takeover.  

This is my /var/log/ha-log from the director which suffers the hostile take
over:
I can post the /var/log/ha-debug log too (although my debug level is only 1)

heartbeat: 2007/03/05_12:12:04 info: Current arena value: 0
heartbeat: 2007/03/05_12:12:04 info: These are nothing to worry about.
heartbeat: 2007/03/05_12:20:45 info: Link seahawk.thezoo:eth2 up.
heartbeat: 2007/03/05_12:20:45 info: Status update for node seahawk.thezoo:
status up
heartbeat: 2007/03/05_12:20:45 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:20:45 info: Exiting status process 29770 returned
rc 0.
heartbeat: 2007/03/05_12:21:09 WARN: 1 lost packet(s) for [seahawk.thezoo]
[6:8]
heartbeat: 2007/03/05_12:21:09 info: Status update for node seahawk.thezoo:
status active
heartbeat: 2007/03/05_12:21:09 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)
' (1))
heartbeat: 2007/03/05_12:21:09 info: No pkts missing from seahawk.thezoo!
heartbeat: 2007/03/05_12:21:09 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:21:09 info: Exiting status process 29958 returned
rc 0.
heartbeat: 2007/03/05_12:21:09 info: other_holds_resources: 2
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 info: remote resource transition completed.
heartbeat: 2007/03/05_12:21:09 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)
' (1))
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own foreign resources!
heartbeat: 2007/03/05_12:21:09 info: other_holds_resources: 3
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own foreign resources!
heartbeat: 2007/03/05_12:21:20 info: other_holds_resources: 3
heartbeat: 2007/03/05_12:21:20 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:20 ERROR: Both machines own foreign resources!



And then on the server doing the hostile takeover:

heartbeat: 2007/03/05_12:20:42 info: Configuration validated. Starting
heartbeat 1.2.3.cvs.2005092
7
heartbeat: 2007/03/05_12:20:42 info: heartbeat: version 1.2.3.cvs.20050927
heartbeat: 2007/03/05_12:20:46 info: Heartbeat generation: 247
heartbeat: 2007/03/05_12:20:46 info: Starting serial heartbeat on tty
/dev/ttyS0 (19200 baud)
heartbeat: 2007/03/05_12:20:46 info: ucast: write socket priority set to
IPTOS_LOWDELAY on eth2
heartbeat: 2007/03/05_12:20:46 info: ucast: bound send socket to device:
eth2
heartbeat: 2007/03/05_12:20:46 info: ucast: bound receive socket to device:
eth2
heartbeat: 2007/03/05_12:20:46 info: ucast: started on port 694 interface
eth2 to 10.0.0.1
heartbeat: 2007/03/05_12:20:46 info: pid 2555 locked in memory.
heartbeat: 2007/03/05_12:20:46 info: Local status now set to: 'up'
heartbeat: 2007/03/05_12:20:47 info: pid 2578 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2576 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2574 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2575 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2577 locked in memory.
heartbeat: 2007/03/05_12:21:11 WARN: node nitehawk.thezoo: is dead
heartbeat: 2007/03/05_12:21:11 info: Local status now set to: 'active'
heartbeat: 2007/03/05_12:21:11 info: Starting child client
"/usr/lib64/heartbeat/ipfail" (17,17)
heartbeat: 2007/03/05_12:21:11 WARN: No STONITH device configured.
heartbeat: 2007/03/05_12:21:11 WARN: Shared disks are not protected.
heartbeat: 2007/03/05_12:21:11 info: Resources being acquired from
nitehawk.thezoo.
heartbeat: 2007/03/05_12:21:11 info: Starting "/usr/lib64/heartbeat/ipfail"
as uid 17  gid 17 (pid
 3645)
heartbeat: 2007/03/05_12:21:11 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:21:12 info: /usr/lib64/heartbeat/mach_down:
nice_failback: foreign resour
ces acquired
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 0, foreign 1,
reason 'T_RESOURCES' (0)
)
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)'
 (0))
heartbeat: 2007/03/05_12:21:12 info: Initial resource acquisition complete
(T_RESOURCES(us))
heartbeat: 2007/03/05_12:21:12 info: mach_down takeover complete.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'mach_down' (1))
heartbeat: 2007/03/05_12:21:12 info: STATE 1 => 3
heartbeat: 2007/03/05_12:21:12 info: mach_down takeover complete for node
nitehawk.thezoo.
heartbeat: 2007/03/05_12:21:12 info: Exiting status process 3647 returned rc
0.
heartbeat: 2007/03/05_12:21:12 info: 1 local resources from
[/usr/lib64/heartbeat/ResourceManager 
listkeys seahawk.thezoo]
heartbeat: 2007/03/05_12:21:12 info: Local Resource acquisition completed.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)'
 (1))
heartbeat: 2007/03/05_12:21:12 info: Exiting req_our_resources process 3648
returned rc 0.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'req_our_resource
s' (1))
heartbeat: 2007/03/05_12:21:12 info: Running /etc/ha.d/rc.d/ip-request-resp
ip-request-resp
heartbeat: 2007/03/05_12:21:12 received ip-request-resp ldirectord OK yes
heartbeat: 2007/03/05_12:21:12 info: Acquiring resource group:
seahawk.thezoo ldirectord LVSSyncDaemonSwap::master
IPaddr2::192.168.0.3/24/eth1/192.168.0.255 .........



My ha.cf
------------------------------------------------------------------------
debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 5
deadtime 20
warntime 10
initdead 20
udpport 694
baud    19200
ucast eth2 10.0.0.1
auto_failback off
node    nitehawk.thezoo
node    seahawk.thezoo

respawn hacluster /usr/lib64/heartbeat/ipfail

deadping 30
debug 1
apiauth ipfail uid=hacluster gid=haclient


My haresources.  The cluster config has seahawk.thezoo as the name of 
the haresources group on both servers.
------------------------------------------------------------------------
seahawk.thezoo \
        ldirectord \
        LVSSyncDaemonSwap::master \
        IPaddr2::192.168.0.3/24/eth1/192.168.0.255 \
        IPaddr2::216.94.150.11/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.15/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.16/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.17/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.18/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.19/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.20/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.21/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.22/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.23/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.24/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.26/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.28/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.30/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.32/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.33/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.34/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.35/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.36/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.37/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.38/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.40/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.42/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.44/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.45/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.47/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.48/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.49/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.50/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.52/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.53/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.57/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.58/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.59/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.61/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.64/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.65/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.66/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.67/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.68/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.69/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.70/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.71/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.73/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.75/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.77/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.80/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.81/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.82/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.84/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.85/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.86/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.88/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.89/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.90/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.91/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.92/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.93/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.95/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.96/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.97/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.98/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.100/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.101/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.104/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.107/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.112/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.113/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.115/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.116/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.117/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.118/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.119/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.120/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.121/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.123/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.124/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.125/24/eth0/216.94.150.127 \
        IPaddr2::216.94.150.126/24/eth0/216.94.150.127 

Seeing as the problem is heartbeat, I wont bother posting my ldirectord.cf
file (which just lists port 80 services for all of the above.

Are my config settings too fine-lined?  Why would it only cause problems
when one of the servers is director and not when the other is?

___________________________________________________
Dan Brown
danb@xxxxxx



<Prev in Thread] Current Thread [Next in Thread>
  • Split Brain issue when certain director is in charge, Dan Brown <=