On Fri, Apr 03, 2009 at 03:18:52PM +0200, Timo Schoeler wrote:
> Hello list,
>
> I have some weird phenoma running ldirectord within heartbeat (v2).
>
> Our load balancer provides some VIPs, that in turn point to some real
> IPs of real servers. Ports used are non-standard, as we deployed some
> proprietary stuff, but the only area where this should be taken into
> account is 'how to test the real servers vitality'. However, at the
> moment we check the servers vitality using
>
> checktype = connect
>
> with the following values
>
> # Global Directives
>
> checktimeout=2
> checkinterval=60
>
> # checkcount only works for ping checks!
> checkcount=2
>
> So, AFAICS ldirectord tests the (real) server on port 6789 (e.g.) and,
> if the port is open, it's 'okay' for the load balancer; if it cannot
> connect, the real server is taken out of service (-> quiescent = yes).
>
> Furthermore, the load balancer should execute the connect check once
> every minute... but unfortunately, this doesn't seem to be true.
> I ran tcpdump and checked for TCP connects between the load balancer and
> one of the real servers and saw that the tests did not occur in the
> interval configured in ldirectord's config.
There is a common misconception surrounding how checkinterval works.
It does not ensure that checks are run every checkinterval seconds.
Rather, it tells ldirectord to sleep for checkinterval seconds
after each iteration of checking every real-server.
If it takes a very short amount of time to check all the real-server, then
the way ldirectord functions converges with they way many people expect it
to behave - this is often the case. However, the longer the checks take,
the more things diverge. In particular, if things are timing-out or there a
lot of real-servers, it can take ldirectord a long time to test all
services.
Why does it work like this? Because the way the code is structured its
rather tricky to do anything else. Though the problem can be mitigated to
some extent using the recently added fork directive, which will fork a
separate ldirectord process for each virtual service.
> We usually have ldirectord configured to do the connect check every two
> seconds (which it also doesn't do). However, we raised the value after
> we had ldirectord.log flooded with entries that shows servers taken out
> of service and taken back into server the next check. With a value of 60
> secs this became less a problem, but still exists.
>
> I'd really appreciate any hint that could
>
> i) make me understand why the (connect) check doesn't happen as expected
> (difference config file <-> real world)
>
> ii) fix the problem of servers taken out of and back into service
> without being 'dead'
Could you try running ldirectord with the -d option, which puts
it into debug mode? This is fairly verbose and ldirectord should
tell you what it thinks it is doing with regards to executing checks
and interpreting the results.
--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: www.vergenet.net/~horms/ W: www.valinux.co.jp/en
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
|