LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [lvs-users] ldirectord feature patch - failurecount option

To: Sean Millichamp <sean@xxxxxxxxxxx>
Subject: Re: [lvs-users] ldirectord feature patch - failurecount option
Cc: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
From: Simon Horman <horms@xxxxxxxxxxxx>
Date: Sat, 16 May 2009 02:23:01 +1000
On Fri, May 15, 2009 at 11:33:03AM -0400, Sean Millichamp wrote:
> On Fri, 2009-05-15 at 09:25 +1000, Simon Horman wrote:
> > On Thu, May 14, 2009 at 05:35:27PM -0400, Sean Millichamp wrote:
> > > This implements a global or per-virtual service option indicating how
> > > many successive check failures must occur before ldirectord will remove
> > > (or quiesce) the real server.  A check success will reset the counter of
> > > failures.
> 
> > I wonder if it would be better for it to replace the existing
> > checkcount implementation. Which seems to be the same but only implemented
> > for ping checks. failurecount does seem to be a better name, perhaps
> > it could be an alias for checkcount or vice versa?
> 
> I agree that failurecount could probably replace checkcount with one
> caveat:  If administrators have a checkcount set globally that they
> expect to only impact pings and we alias it to impact everything then an
> update will change the behavior they had before.
> 
> Now, you could argue that an administrator of a Linux load balancer
> running ldirectord should know better than to upgrade a key piece of
> software without reading the documentation and changelogs.  I would
> agree, but surely someone will upgrade without doing that level of
> investigation.
> 
> I feel like the right approach is to leave both included, for now, and
> maybe mark checkcount as deprecated with a configuration warning that it
> will be removed in an future release.  Thoughts?

That is a very good point, I agree.

> > > +The number of consecutive times a failure will have to be reported by a
> > > +check before the realserver is removed from the kernel's LVS table.  A 
> > > +value of 1 will have the realserver removed on the first failure.  A
> > > +successful check will reset the failure counter to 0.
> > 
> > I think it would be better to say something like "before the realserver
> > is considred to have failed" as if quiescent=yes the server won't
> > actually be removed from the kernel's LVS table.
> 
> Agreed.
> 
> [crazy patch snipped]
> > This logic looks a bit hairy, though not unexpectedly so.
> > How well tested is it?
> 
> Indeed!  It took me a while to unravel exactly what was going on and how
> it needed to behave.  We have been running this patch for about 2 weeks
> now without issues.  Our heavily loaded servers will still fail checks
> periodically (as evidenced by output in the monitorfile) but they don't
> get removed unless the checks continue to fail.  Also, prior to
> deploying it I tried to simulate a number of failure scenarios that we
> might see in our environment (quiescent=yes, fork=yes, connect and
> negotiate check types) and all worked as desired.
> 
> I am attaching a patch on the previous patch that adds deprecation
> warnings and cleans up the failurecount documentation (I don't see a way
> in hg to export a combined patch).

Thanks. This looks good. I'll review it a bit more closely once
I've had some sleep.


_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

<Prev in Thread] Current Thread [Next in Thread>