| 
 Hi! 
On Fri, 06 Jul 2007, Gerry Reno wrote:
> Tobias Klausmann wrote:
> > First, it seems it's no longer triggered by config reloads but
> > "just happens". Also, it happens very infrequently, maybe once a
> > month, probably even less often - that is, over the five[0]
> > productive and one test LBs, so statistically, it probably
> > happens once or twice a year on a single LB.
> >   
> Infrequent, spurious problems are tough.
Indeed. And running keepalived in gdb or with strace for months
on end ist not really an option.
> > As such, it's pretty much impossible to reproduce. The symptoms
> > are slightly different, to: keepalived *looks* okay, but it just
> > doesn't see when a server disappears. Also, it eventually starts
> > ignoring HUP completely. It's not completely frozen though: it
> > keeps doing checks.
> >   
> How do you detect the condition? Are you monitoring keepalived somehow?
> What actions are necessary to recover?
Usually, we find out when someone loses a server and requests are
directed to that server although it shouldn't happen. We're doing
some monitoring but this condition is usually detetced by
monitoring the service itself (and hoping it's "lucky" enough to
get sent to the b0rken server). 
Restarting keepalived (and thus nuking *all* services on the LB)
fixes the problem. Naturally, restarting every N hours, days,
weeks is not an option as all sessions would be ripped apart. And
it's not really a clean solution, either.
> > Another odd thing I've witnessed: if you tell keepalived to bind
> > to an IP (for the checks) that is'nt configured, it will complain
> > a bit but still continue trying - and leaving everything
> > inservice. I think it should either complain more loudly or take
> > everything out of service as not being able to check is about the
> > same as everything being down.
>   
> Have you discussed this with keepalived team?
Not yet. I planned doing that just after I've seen how ldirectord
work. Maybe setting it up and testing it would yield insight into
how/why/when keepalived fails. I usually suspect errors in my
methodology first. :)
Another problem with reporting is that the info I can give is
quite vague (as you've seen). I always try to be as helpful as
possible when reporting a bug. The absence of reproducibility (is
that even a word?) annoys me most.
Regards,
Tobias
-- 
In the future, everyone will be anonymous for 15 minutes.
 |