LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Split Brain issue when certain director is in charge

To: Dan Brown <danb@xxxxxx>
Subject: Re: Split Brain issue when certain director is in charge
Cc: "'LinuxVirtualServer.org users mailing list.'" <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
From: Simon Horman <horms@xxxxxxxxxxxx>
Date: Tue, 24 Apr 2007 12:21:55 +0900
On Tue, Mar 06, 2007 at 01:58:47PM -0600, Dan Brown wrote:
> 
> I have a pair of servers running in a streamlined high-availability
> load-balancing setup using UltraMonkey 3.  I am finding however that when a
> certain director (on server nitehawk) is in charge, it causes a split brain
> issue between the two servers as the other server (seahawk) will come up and
> try to take over resources.  It will run ok for a while (like twenty minutes
> to an hour) but eventually of course things run amuck.  When the other
> director is in charge, the other director (on nitehawk) will wait patiently
> like it's supposed to and not attempt a takeover unless the other server
> (seahawk) drops out.

Hi Dan,

this is kind of curious. Off the top of my head I wonder if the
one of the following two problems are occuring

1) the timeouts are too short

2) Its taking too long to bring up the large number of IPaddr resources
   you have listed - I recommend using a fwmark instead if possible.

To be honest, I doubt that either is the case, but it should be easy
enough to test these parameters and see if they resolve the problem.

[snip]

> And then on the server doing the hostile takeover:
> 
> heartbeat: 2007/03/05_12:20:42 info: Configuration validated. Starting
> heartbeat 1.2.3.cvs.2005092
> 7
> heartbeat: 2007/03/05_12:20:42 info: heartbeat: version 1.2.3.cvs.20050927
> heartbeat: 2007/03/05_12:20:46 info: Heartbeat generation: 247
> heartbeat: 2007/03/05_12:20:46 info: Starting serial heartbeat on tty
> /dev/ttyS0 (19200 baud)
> heartbeat: 2007/03/05_12:20:46 info: ucast: write socket priority set to
> IPTOS_LOWDELAY on eth2
> heartbeat: 2007/03/05_12:20:46 info: ucast: bound send socket to device:
> eth2
> heartbeat: 2007/03/05_12:20:46 info: ucast: bound receive socket to device:
> eth2
> heartbeat: 2007/03/05_12:20:46 info: ucast: started on port 694 interface
> eth2 to 10.0.0.1
> heartbeat: 2007/03/05_12:20:46 info: pid 2555 locked in memory.
> heartbeat: 2007/03/05_12:20:46 info: Local status now set to: 'up'
> heartbeat: 2007/03/05_12:20:47 info: pid 2578 locked in memory.
> heartbeat: 2007/03/05_12:20:47 info: pid 2576 locked in memory.
> heartbeat: 2007/03/05_12:20:47 info: pid 2574 locked in memory.
> heartbeat: 2007/03/05_12:20:47 info: pid 2575 locked in memory.
> heartbeat: 2007/03/05_12:20:47 info: pid 2577 locked in memory.
> heartbeat: 2007/03/05_12:21:11 WARN: node nitehawk.thezoo: is dead

The above line is indicates that this machine thinks the other is
down. Perhaps logs of some priority are missing, as it seems
odd that there are no warnings about links being down.

Alternatively, perhaps it is a configuration error.
The configurations are the same on both machines, right?

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/


<Prev in Thread] Current Thread [Next in Thread>
  • Re: Split Brain issue when certain director is in charge, Simon Horman <=