LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [lvs-users] ldirectord feature patch - failurecount option

To: Simon Horman <horms@xxxxxxxxxxxx>
Subject: Re: [lvs-users] ldirectord feature patch - failurecount option
Cc: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
From: Sean Millichamp <sean@xxxxxxxxxxx>
Date: Fri, 15 May 2009 11:33:03 -0400
On Fri, 2009-05-15 at 09:25 +1000, Simon Horman wrote:
> On Thu, May 14, 2009 at 05:35:27PM -0400, Sean Millichamp wrote:
> > This implements a global or per-virtual service option indicating how
> > many successive check failures must occur before ldirectord will remove
> > (or quiesce) the real server.  A check success will reset the counter of
> > failures.

> I wonder if it would be better for it to replace the existing
> checkcount implementation. Which seems to be the same but only implemented
> for ping checks. failurecount does seem to be a better name, perhaps
> it could be an alias for checkcount or vice versa?

I agree that failurecount could probably replace checkcount with one
caveat:  If administrators have a checkcount set globally that they
expect to only impact pings and we alias it to impact everything then an
update will change the behavior they had before.

Now, you could argue that an administrator of a Linux load balancer
running ldirectord should know better than to upgrade a key piece of
software without reading the documentation and changelogs.  I would
agree, but surely someone will upgrade without doing that level of
investigation.

I feel like the right approach is to leave both included, for now, and
maybe mark checkcount as deprecated with a configuration warning that it
will be removed in an future release.  Thoughts?

> > +The number of consecutive times a failure will have to be reported by a
> > +check before the realserver is removed from the kernel's LVS table.  A 
> > +value of 1 will have the realserver removed on the first failure.  A
> > +successful check will reset the failure counter to 0.
> 
> I think it would be better to say something like "before the realserver
> is considred to have failed" as if quiescent=yes the server won't
> actually be removed from the kernel's LVS table.

Agreed.

[crazy patch snipped]
> This logic looks a bit hairy, though not unexpectedly so.
> How well tested is it?

Indeed!  It took me a while to unravel exactly what was going on and how
it needed to behave.  We have been running this patch for about 2 weeks
now without issues.  Our heavily loaded servers will still fail checks
periodically (as evidenced by output in the monitorfile) but they don't
get removed unless the checks continue to fail.  Also, prior to
deploying it I tried to simulate a number of failure scenarios that we
might see in our environment (quiescent=yes, fork=yes, connect and
negotiate check types) and all worked as desired.

I am attaching a patch on the previous patch that adds deprecation
warnings and cleans up the failurecount documentation (I don't see a way
in hg to export a combined patch).

Sean

Attachment: ldirectord-failurecount-doccleanup.patch
Description: Text Data

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
<Prev in Thread] Current Thread [Next in Thread>