Re: [lvs-users] Ldirectord not working with heartbeat, works standalone

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [lvs-users] Ldirectord not working with heartbeat, works standalone
From: Bruce Richardson <itsbruce@xxxxxxxxxxx>
Date: Tue, 10 Feb 2009 11:49:21 +0000
On Tue, Feb 10, 2009 at 11:55:32AM +0100, Sebastian Vieira wrote:
> On Mon, Feb 9, 2009 at 4:06 PM, Bruce Richardson <itsbruce@xxxxxxxxxxx> wrote:
> >        2.  There is a time delay in startind ldirectord.  If you
> >        configure it as a hearbeat resource, then LVS remains in an
> >        out-of-date configuration (or completely unconfigured) on the
> >        new master until ldirectord has been shut down on the old master
> >        and started on the new one.
> Not entirely true. You can define a 'callback' option in
> which can be used to launch a script that synchronizes the
> configuration on all relevant nodes. This way whenever there's a
> change in made, the changes will also be made on the
> 'backup' director.

I possibly was not clear: I'm not talking about the on-disk
configuration (which I would make damn sure is in sync via cfengine or
puppet or similar); I'm talking about the LVS configuration as managed
by ldirectord: that is, the actual state of the virtual services (which
real servers are up or down etc.).  If ldirectord is turned off on
inactive directors then the LVS configuration on those servers may not
reflect the current situation and on the restart of ldirectord there
will be a delay while this discrepancy is detected.  This is the main
risk with managing ldirectord as a hearbeat resource and I see it as a
significant enough danger to avoid any danger of it.

Consider that if hearbeat fails over from one director to another, this
may well be because of network problems.  If there are network problems,
some of the real servers on the system may have been disrupted.  I can
easily visualise a situation where 

  1.  Some of the real servers are temporarily disrupted  
  2.  The active director marks them as offline and removes them from
  3.  The active director fails over to its HA twin.
  4.  The HA twin starts ldirectord but its LVS configuration is out of
      date because the real servers were in a different state the last
      time ldirectord was running.  There is disruption to the virtual
      services until the new master catches up with the current status
      of the network.
  5.  The real servers recover from the temporary disruption and the new
      master director adds them back in.
  6.  At some arbitrary point in the future, the new master director
      fails over to the original master (maybe because a sysadmin
      manually triggered a failover for admin purposes).  BANG - the old
      master has a state that hasn't changed since the network issues
      that also caused the failover.  There is disruption to the virtual
      services until the LVS tables are up to date.

 The extra disruption at steps 4 and 6 is quite unnecessary and
 avoidable.  As this demonstrates, running ldirectord as a hearbeat
 resource has the potential to create timebombs waiting for ordinary
 maintenance or a simple temporary cable failure (which should
 ordinarily not be a problem in a HA environment) to trigger them.  I
 plan HA systems to minimize the impact of component failure; since the
 configuration described above would actually magnify the impact, I
 avoid it.

> You can also have the ipvsadm sychronisation
> daemon(s) running so that the LVS table itself is being replicated to
> the other node. It's my experience that almost all active connections
> will be picked up by the 'backup' director upon failover so that
> clients won't experience much more than a slight delay.

As I understand it, the synd daemon *only* synchronises connection
information, not real server state.  This would not fix the scenario

I repeat my main question: what is the gain of doing this?  There is no
need to failover ldirectord; it is safer not to do it.  The only tiny
benefit I can see is that you reduce the number of uptime checks hitting
your real servers but a) you will have to exclude all your director ip
addresses from any log analysis (e.g. webstats) anyway, no matter how
many of them are running ldirectord at once and b) the load imposed by
the director's uptime checks should be tiny compared to the actual
workload, so this really is not a win.


Remember you're a Womble.

Please read the documentation before posting - it's available at: mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to

<Prev in Thread] Current Thread [Next in Thread>