| 
 
I've been using a Foundry Networks ServerIron XL until now, with DSR
("Direct Server Response" aka "Direct Response", "DR") to load-balance
one virtual server to six real web servers.
All 6 real servers are identical and have the same weight. When one of
the real web servers gets overloaded, the URL checked on it starts
sending a 500 status code instead of the normal 200. In this case, I
would expect the real server to be taken out of the load-balancing and
all traffic to be sent to the other 5 remaining real servers. But no.
In the above scenario, what I saw in the ServerIron logs was that the
real server was properly detected as "down", but all web traffic got
sent to this server instead of having it taken offline(!!), nothing
sent to the others, thus timeouts and some 500 errors for all clients,
and a real server in bad shape, needing to be rebooted in many cases.
 
Either a massive bug in the ServerIron Firmware or a configuration 
glitch on your side. Care to post the relevant part of the configuration? 
 
I thought this was a bug with the ServerIron. So I looked at LVS.
I implemented a parallel identical setup using LVS and keepalived. The
setup is similar, with LVS-DR and all 6 real web servers. Only the
virtual server IP address changes, obviously, to keep both setups
in parallel.
My limited testing worked fine, but when I started sending real
traffic, the exact same issue as with the ServerIron happened!
Symptoms :
- The virtual IP address no longer responds on port 80, it times out
- The real server having problems gets REALLY overloaded
- All other 5 real servers no longer receive any traffic
- ipvsadm show no new active or inactive connections (counters stay the
same), only the persistent connections counters decrease slowly (as
expected since there are no new connections...)
Even after restarting the problematic real server, keepalived re-adds
it properly, but nothing works anymore. I need to restart keepalived
(which flushes the ipvsadm configuration by default) for things to
start working again.
 
How exactly do you get your RS to dynamically switch from HTTP response 
code 200 to 500? Have you checked the HTTP response header using a CLI 
tool like curl, lynx or wget? 
 
I am really confused. I've tried stopping the web daemon on one of the
real servers under production load, and it gets taken out as expected,
and all keeps working fine. It seems that only when the web server still
responds with 500 status and gets detected as down, then up, then down
again etc. does the problem appear. Note that the setup can work fine
for hours and hours, the issue only appears when a real server has a
problem.
 
This however sounds more like a "flapping" or threshold ping-pong issue.
 
I would like to have tried some kind of "keep the real server disabled
for n seconds when it's detected as down" in order to keep the check
from flip-flopping like this, but there is no such setting in
keepalived AFAICS.
 
Would it be possible and good enough for you to use the threshold 
limitation feature by setting an upper and lower threshold for the 
amount of active + inactive connections? 
Regards,
Roberto Nibali, ratz
--
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc 
 |