I have a pair of servers running in a streamlined high-availability
load-balancing setup using UltraMonkey 3. I am finding however that when a
certain director (on server nitehawk) is in charge, it causes a split brain
issue between the two servers as the other server (seahawk) will come up and
try to take over resources. It will run ok for a while (like twenty minutes
to an hour) but eventually of course things run amuck. When the other
director is in charge, the other director (on nitehawk) will wait patiently
like it's supposed to and not attempt a takeover unless the other server
(seahawk) drops out.
They are connected via crossover ethernet and serial cable. The ethernet
link is also used for csync2 file replication and mysql replication. As you
can see, it appears a single packet lost (7th line down) causes the other
server to attempt a takeover.
This is my /var/log/ha-log from the director which suffers the hostile take
I can post the /var/log/ha-debug log too (although my debug level is only 1)
heartbeat: 2007/03/05_12:12:04 info: Current arena value: 0
heartbeat: 2007/03/05_12:12:04 info: These are nothing to worry about.
heartbeat: 2007/03/05_12:20:45 info: Link seahawk.thezoo:eth2 up.
heartbeat: 2007/03/05_12:20:45 info: Status update for node seahawk.thezoo:
status up
heartbeat: 2007/03/05_12:20:45 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:20:45 info: Exiting status process 29770 returned
rc 0.
heartbeat: 2007/03/05_12:21:09 WARN: 1 lost packet(s) for [seahawk.thezoo]
heartbeat: 2007/03/05_12:21:09 info: Status update for node seahawk.thezoo:
status active
heartbeat: 2007/03/05_12:21:09 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)
' (1))
heartbeat: 2007/03/05_12:21:09 info: No pkts missing from seahawk.thezoo!
heartbeat: 2007/03/05_12:21:09 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:21:09 info: Exiting status process 29958 returned
rc 0.
heartbeat: 2007/03/05_12:21:09 info: other_holds_resources: 2
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 info: remote resource transition completed.
heartbeat: 2007/03/05_12:21:09 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)
' (1))
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own foreign resources!
heartbeat: 2007/03/05_12:21:09 info: other_holds_resources: 3
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:09 ERROR: Both machines own foreign resources!
heartbeat: 2007/03/05_12:21:20 info: other_holds_resources: 3
heartbeat: 2007/03/05_12:21:20 ERROR: Both machines own our resources!
heartbeat: 2007/03/05_12:21:20 ERROR: Both machines own foreign resources!
And then on the server doing the hostile takeover:
heartbeat: 2007/03/05_12:20:42 info: Configuration validated. Starting
heartbeat 1.2.3.cvs.2005092
heartbeat: 2007/03/05_12:20:42 info: heartbeat: version 1.2.3.cvs.20050927
heartbeat: 2007/03/05_12:20:46 info: Heartbeat generation: 247
heartbeat: 2007/03/05_12:20:46 info: Starting serial heartbeat on tty
/dev/ttyS0 (19200 baud)
heartbeat: 2007/03/05_12:20:46 info: ucast: write socket priority set to
heartbeat: 2007/03/05_12:20:46 info: ucast: bound send socket to device:
heartbeat: 2007/03/05_12:20:46 info: ucast: bound receive socket to device:
heartbeat: 2007/03/05_12:20:46 info: ucast: started on port 694 interface
eth2 to
heartbeat: 2007/03/05_12:20:46 info: pid 2555 locked in memory.
heartbeat: 2007/03/05_12:20:46 info: Local status now set to: 'up'
heartbeat: 2007/03/05_12:20:47 info: pid 2578 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2576 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2574 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2575 locked in memory.
heartbeat: 2007/03/05_12:20:47 info: pid 2577 locked in memory.
heartbeat: 2007/03/05_12:21:11 WARN: node nitehawk.thezoo: is dead
heartbeat: 2007/03/05_12:21:11 info: Local status now set to: 'active'
heartbeat: 2007/03/05_12:21:11 info: Starting child client
"/usr/lib64/heartbeat/ipfail" (17,17)
heartbeat: 2007/03/05_12:21:11 WARN: No STONITH device configured.
heartbeat: 2007/03/05_12:21:11 WARN: Shared disks are not protected.
heartbeat: 2007/03/05_12:21:11 info: Resources being acquired from
heartbeat: 2007/03/05_12:21:11 info: Starting "/usr/lib64/heartbeat/ipfail"
as uid 17 gid 17 (pid
heartbeat: 2007/03/05_12:21:11 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/03/05_12:21:12 info: /usr/lib64/heartbeat/mach_down:
nice_failback: foreign resour
ces acquired
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 0, foreign 1,
reason 'T_RESOURCES' (0)
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)'
heartbeat: 2007/03/05_12:21:12 info: Initial resource acquisition complete
heartbeat: 2007/03/05_12:21:12 info: mach_down takeover complete.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'mach_down' (1))
heartbeat: 2007/03/05_12:21:12 info: STATE 1 => 3
heartbeat: 2007/03/05_12:21:12 info: mach_down takeover complete for node
heartbeat: 2007/03/05_12:21:12 info: Exiting status process 3647 returned rc
heartbeat: 2007/03/05_12:21:12 info: 1 local resources from
listkeys seahawk.thezoo]
heartbeat: 2007/03/05_12:21:12 info: Local Resource acquisition completed.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'T_RESOURCES(us)'
heartbeat: 2007/03/05_12:21:12 info: Exiting req_our_resources process 3648
returned rc 0.
heartbeat: 2007/03/05_12:21:12 info: AnnounceTakeover(local 1, foreign 1,
reason 'req_our_resource
s' (1))
heartbeat: 2007/03/05_12:21:12 info: Running /etc/ha.d/rc.d/ip-request-resp
heartbeat: 2007/03/05_12:21:12 received ip-request-resp ldirectord OK yes
heartbeat: 2007/03/05_12:21:12 info: Acquiring resource group:
seahawk.thezoo ldirectord LVSSyncDaemonSwap::master
IPaddr2:: .........
My ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 5
deadtime 20
warntime 10
initdead 20
udpport 694
baud 19200
ucast eth2
auto_failback off
node nitehawk.thezoo
node seahawk.thezoo
respawn hacluster /usr/lib64/heartbeat/ipfail
deadping 30
debug 1
apiauth ipfail uid=hacluster gid=haclient
My haresources. The cluster config has seahawk.thezoo as the name of
the haresources group on both servers.
seahawk.thezoo \
ldirectord \
LVSSyncDaemonSwap::master \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
IPaddr2:: \
Seeing as the problem is heartbeat, I wont bother posting my ldirectord.cf
file (which just lists port 80 services for all of the above.
Are my config settings too fine-lined? Why would it only cause problems
when one of the servers is director and not when the other is?
Dan Brown