LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [ANNOUNCE] Keepalived 1.1.9

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [ANNOUNCE] Keepalived 1.1.9
From: Tobias Klausmann <klausman@xxxxxxxxxxxxxxx>
Date: Mon, 7 Feb 2005 16:31:24 +0100
Hi! 

On Mon, 07 Feb 2005, Alexandre Cassen wrote:
> 2005-02-07  Alexandre Cassen  <acassen@xxxxxxxxxxxx>
>         * Removed the watchdog framework. Since scheduling framework
>           support child, we register a child thread for both process
>           VRRP & Healthcheck. When child die or stop prematuraly this
>           launch scheduling callback previously registered. Watchdog
>           is now handled by signaling.
>           (credit goes to Kevin Lindsay, <kevinl@xxxxxxxxxxxxx> for nice
>           idea).

I had hoped that this would fix the continued (and elusive)
trouble I've had with keepalived. For reference, here's what I'm
plagued with:

I run a load balancer which has 24 TCP farms, distributed over
~80 realservers. Most of them are Port 80, some are different. I
use wlc on all of them and all realservers have the same weight
(8).

The configuration framework I've written allows individual farm
admins to activate and deactivate realserver without intervention
from me. To this end, the cfg backend generates keepalived's
config, removing or adding the requested realserver. After that
it sends a HUP signal to keepalived (the father process). Most of
the time this works well. But sometimes, keepalived becomes
catatonic, not noticing bad realservers and the like. 

I *think* this might be the race condition Kevin Lindsay(?)
spotted, but I'm not sure. I had hoped it would be gone with
1.1.9, but it seems it isn't (it just failed again a few minutes
ago).

With 1.1.7, this was visible in that the socket in /tmp was gone
and keealived complained about being unable to connect the wdog
socket. With 1.1.9 the problem is the same, but there is no
socket at all, so I can't even see that keepalive is catatonic.

How can I help to resolve this as quickly as possible? The
machine is in productive use and our SLA margins are starting to
slip, so I'm very keen on getting it fixed.

Regards,
Tobias

-- 
export DISPLAY=vt100

<Prev in Thread] Current Thread [Next in Thread>