Hi,
I've been using a Foundry Networks ServerIron XL until now, with DSR
("Direct Server Response" aka "Direct Response", "DR") to load-balance
one virtual server to six real web servers.
All 6 real servers are identical and have the same weight. When one of
the real web servers gets overloaded, the URL checked on it starts
sending a 500 status code instead of the normal 200. In this case, I
would expect the real server to be taken out of the load-balancing and
all traffic to be sent to the other 5 remaining real servers. But no.
In the above scenario, what I saw in the ServerIron logs was that the
real server was properly detected as "down", but all web traffic got
sent to this server instead of having it taken offline(!!), nothing
sent to the others, thus timeouts and some 500 errors for all clients,
and a real server in bad shape, needing to be rebooted in many cases.
I thought this was a bug with the ServerIron. So I looked at LVS.
I implemented a parallel identical setup using LVS and keepalived. The
setup is similar, with LVS-DR and all 6 real web servers. Only the
virtual server IP address changes, obviously, to keep both setups
in parallel.
My limited testing worked fine, but when I started sending real
traffic, the exact same issue as with the ServerIron happened!
Symptoms :
- The virtual IP address no longer responds on port 80, it times out
- The real server having problems gets REALLY overloaded
- All other 5 real servers no longer receive any traffic
- ipvsadm show no new active or inactive connections (counters stay the
same), only the persistent connections counters decrease slowly (as
expected since there are no new connections...)
Even after restarting the problematic real server, keepalived re-adds
it properly, but nothing works anymore. I need to restart keepalived
(which flushes the ipvsadm configuration by default) for things to
start working again.
I am really confused. I've tried stopping the web daemon on one of the
real servers under production load, and it gets taken out as expected,
and all keeps working fine. It seems that only when the web server still
responds with 500 status and gets detected as down, then up, then down
again etc. does the problem appear. Note that the setup can work fine
for hours and hours, the issue only appears when a real server has a
problem.
I would like to have tried some kind of "keep the real server disabled
for n seconds when it's detected as down" in order to keep the check
from flip-flopping like this, but there is no such setting in
keepalived AFAICS.
Has anyone already seen a similar problem? I've read many posts in the
archive regarding LVS-DR issues, but haven't seen anyone complaining
from the same, and it seems like the LVS-DR setup works fine for many!
Details :
- RHEL4 i386 fully updated (2.6.9-42.0.3.ELsmp) on all servers
- lighttpd and PHP web servers
- ipvsadm 1.24-6 (from RHN)
- keepalived 1.1.13
- virtual server IP address configured as lo:0 on all real servers
- sysctl changes made on real servers :
net.ipv4.conf.eth0.arp_ignore = 2 (public lan)
net.ipv4.conf.eth1.arp_ignore = 2 (private lan)
- each real server is outputting approx. 3Mbps
I've tried with sh, lc and rr schedulers, but same thing with all.
...help!? :-)
Matthias
--
Clean custom Red Hat Linux rpm packages : http://freshrpms.net/
Fedora Core release 6 (Zod) - Linux kernel 2.6.19-1.2895.fc6
Load : 0.08 0.08 0.08
|