LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [lvs-users] Problems with ldirectord: Doesn't check like advised in

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [lvs-users] Problems with ldirectord: Doesn't check like advised in config, real servers not dead but taken out of service
Cc: Simon Horman <horms@xxxxxxxxxxxx>
From: Timo Schoeler <timo.schoeler@xxxxxxxxxxxxx>
Date: Wed, 15 Apr 2009 09:05:25 +0200
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

| On Fri, Apr 03, 2009 at 03:18:52PM +0200, Timo Schoeler wrote:
|> Hello list,
|>
|> I have some weird phenoma running ldirectord within heartbeat (v2).
|>
|> Our load balancer provides some VIPs, that in turn point to some real
|> IPs of real servers. Ports used are non-standard, as we deployed some
|> proprietary stuff, but the only area where this should be taken into
|> account is 'how to test the real servers vitality'. However, at the
|> moment we check the servers vitality using
|>
|> checktype = connect
|>
|> with the following values
|>
|> # Global Directives
|>
|> checktimeout=2
|> checkinterval=60
|>
|> # checkcount only works for ping checks!
|> checkcount=2
|>
|> So, AFAICS ldirectord tests the (real) server on port 6789 (e.g.) and,
|> if the port is open, it's 'okay' for the load balancer; if it cannot
|> connect, the real server is taken out of service (-> quiescent = yes).
|>
|> Furthermore, the load balancer should execute the connect check once
|> every minute... but unfortunately, this doesn't seem to be true.
|
|> I ran tcpdump and checked for TCP connects between the load balancer and
|> one of the real servers and saw that the tests did not occur in the
|> interval configured in ldirectord's config.
|
| There is a common misconception surrounding how checkinterval works.
| It does not ensure that checks are run every checkinterval seconds.
| Rather, it tells ldirectord to sleep for checkinterval seconds
| after each iteration of checking every real-server.

Okay, thank you very much for this information -- it was the point I was
missing. Maybe the man page is not too clear on this...

| If it takes a very short amount of time to check all the real-server, then
| the way ldirectord functions converges with they way many people expect it
| to behave - this is often the case. However, the longer the checks take,
| the more things diverge. In particular, if things are timing-out or
there a
| lot of real-servers, it can take ldirectord a long time to test all
| services.

That was what I saw here, yes. Our setup is not too huge, but it takes
some time to test the machines.

| Why does it work like this? Because the way the code is structured its
| rather tricky to do anything else. Though the problem can be mitigated to
| some extent using the recently added fork directive, which will fork a
| separate ldirectord process for each virtual service.

Well, this works perfectly for me. I jumped in and enabled it
straightforward on our production system, it spawned lots of children
and now the checks happen when expected. So this is a huge leap forward,
thanks again.

|> We usually have ldirectord configured to do the connect check every two
|> seconds (which it also doesn't do). However, we raised the value after
|> we had ldirectord.log flooded with entries that shows servers taken out
|> of service and taken back into server the next check. With a value of 60
|> secs this became less a problem, but still exists.
|>
|> I'd really appreciate any hint that could
|>
|> i) make me understand why the (connect) check doesn't happen as expected
|> (difference config file <-> real world)
|>
|> ii) fix the problem of servers taken out of and back into service
|> without being 'dead'
|
| Could you try running ldirectord with the -d option, which puts
| it into debug mode? This is fairly verbose and ldirectord should
| tell you what it thinks it is doing with regards to executing checks
| and interpreting the results.

The problem is that I have a lab system for testing and a production
system, the latter one being still 'more complex' as I'm not able to
(simply) emulate all the real servers (and the network devices) in the lab.

One problem -- that of 'unexptected'/'delayed' tests is fixed; another
one still remains: Servers are taken out of service ('Quiescent real
server') and put back again (restored) even when they were reachable...
They are put back into service usually when the next check occured. I
run a script (using nc(1)) in parallel to see whether the port(s)
checked (by ldirectord) were really offline or not; in my case, the
ports are reachable and should NOT trigger ldirectord to quiescent a
(real) server...

I can't turn on debugging on the production system; I'm almost sure it
will turn the machine unuseable (depending on the sheer mass of output
it'd generate).

However, is this phenomenon a known one? Is there something like a
're-test' before turning a server quiescent (like 'checkcount' for ping
checks)?

I apologize for the delay of my answer, I had lots of things to manage
due to Easter Holidays...

Thanks in advance & best,

Timo

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org

iD8DBQFJ5Yc1fg746kcGBOwRApg8AKCKUEzfy7uQ0vGXxqoKmGZBxzvh8gCfYZ5a
V67Q2mZ8Uokh8PD4rUkm3+Q=
=8BP+
-----END PGP SIGNATURE-----

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

<Prev in Thread] Current Thread [Next in Thread>