Re: [lvs-users] Ldirectord not working with heartbeat, works standalone

To: " users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [lvs-users] Ldirectord not working with heartbeat, works standalone
From: Simon Horman <horms@xxxxxxxxxxxx>
Date: Thu, 12 Feb 2009 14:22:23 +1100
On Tue, Feb 10, 2009 at 01:28:36PM +0100, Sebastian Vieira wrote:
> On Tue, Feb 10, 2009 at 12:49 PM, Bruce Richardson 
> <itsbruce@xxxxxxxxxxx>wrote:
> > If ldirectord is turned off on inactive directors then the LVS
> > configuration on those servers may not
> > reflect the current situation and on the restart of ldirectord there
> > will be a delay while this discrepancy is detected.  This is the main
> > risk with managing ldirectord as a hearbeat resource and I see it as a
> > significant enough danger to avoid any danger of it.
> Ah, i see your point. I never came across this issue before.
> But consider this: if a director does a failover it's because there's a
> problem with the network so any state change of a realserver gets 'lost'
> anyway during the timeout period you specified in heartbeat. If ldirectord
> starts i assume it immediately issues health-checks and thus sets the
> availability of the realservers accordingly. You can shorten the time
> between health-checks somewhat to minimize this period, but still it would
> be a period in which clients could be routed to unavailable realservers.
> Maybe it's an idea to have ldirector parse its configuration file and
> instead of first setting up the realserver entry in the LVS table, have it
> issue a health-check for each realserver it comes across. Then set up the
> entry according to it's availability.

Just to clarify. When ldirectord starts up it clears the LVS table and then
adds real-servers as they are checked. The there is no time between checks
- only a pause between each iteration of all checks (checkinterval).  The
checks are run in serial. And each check finishes when it either gets a
result from the real-server or times-out.

In short, it works a bit like this:

while 1
    foreach rs in real-server
         check_realserver rs || timeout
    sleep checkinterval

Typically timeouts take longer than getting a result.  And can thus result
in delays in checking subsequent real-servers.  It isn't unreasonable to
expect some timeouts if the network is in a situation where fail-over has
occurred.  So this could be a problem as it may take a while for ldirectord
to check all real-servers.

A way around this problem is to enable the recently added fork option to
ldirectord. When this is present it forks a process for each real-server to
be checked and the checks can run in parallel.

In "fork" mode, ldirectord behaves more like this:

while 1
    foreach rs in real-server
         if child-process[rs] isn't present
             child-process[rs] = fork_child(rs)
    sleep 1

And in each child process:

while 1
    check-realserver || timeout
    sleep checkinterval

> But yes, i agree, if you want to eliminate this 'lost' period the best way
> would be to have ldirectord running on both nodes at all times. An argument
> that i thought of "heartbeat makes sure ldirectord is running" is moot if
> you have puppet handle the service state.

For the record, my position on this is that its better to have
ldirectord running on both hosts all the time.

Simon Horman
  VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
  H:             W:

Please read the documentation before posting - it's available at: mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to

<Prev in Thread] Current Thread [Next in Thread>