On Fri, May 15, 2009 at 11:33:03AM -0400, Sean Millichamp wrote:
> On Fri, 2009-05-15 at 09:25 +1000, Simon Horman wrote:
> > On Thu, May 14, 2009 at 05:35:27PM -0400, Sean Millichamp wrote:
> > > This implements a global or per-virtual service option indicating how
> > > many successive check failures must occur before ldirectord will remove
> > > (or quiesce) the real server. A check success will reset the counter of
> > > failures.
>
> > I wonder if it would be better for it to replace the existing
> > checkcount implementation. Which seems to be the same but only implemented
> > for ping checks. failurecount does seem to be a better name, perhaps
> > it could be an alias for checkcount or vice versa?
>
> I agree that failurecount could probably replace checkcount with one
> caveat: If administrators have a checkcount set globally that they
> expect to only impact pings and we alias it to impact everything then an
> update will change the behavior they had before.
>
> Now, you could argue that an administrator of a Linux load balancer
> running ldirectord should know better than to upgrade a key piece of
> software without reading the documentation and changelogs. I would
> agree, but surely someone will upgrade without doing that level of
> investigation.
>
> I feel like the right approach is to leave both included, for now, and
> maybe mark checkcount as deprecated with a configuration warning that it
> will be removed in an future release. Thoughts?
That is a very good point, I agree.
> > > +The number of consecutive times a failure will have to be reported by a
> > > +check before the realserver is removed from the kernel's LVS table. A
> > > +value of 1 will have the realserver removed on the first failure. A
> > > +successful check will reset the failure counter to 0.
> >
> > I think it would be better to say something like "before the realserver
> > is considred to have failed" as if quiescent=yes the server won't
> > actually be removed from the kernel's LVS table.
>
> Agreed.
>
> [crazy patch snipped]
> > This logic looks a bit hairy, though not unexpectedly so.
> > How well tested is it?
>
> Indeed! It took me a while to unravel exactly what was going on and how
> it needed to behave. We have been running this patch for about 2 weeks
> now without issues. Our heavily loaded servers will still fail checks
> periodically (as evidenced by output in the monitorfile) but they don't
> get removed unless the checks continue to fail. Also, prior to
> deploying it I tried to simulate a number of failure scenarios that we
> might see in our environment (quiescent=yes, fork=yes, connect and
> negotiate check types) and all worked as desired.
>
> I am attaching a patch on the previous patch that adds deprecation
> warnings and cleans up the failurecount documentation (I don't see a way
> in hg to export a combined patch).
Thanks. This looks good. I'll review it a bit more closely once
I've had some sleep.
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
|