Setup:
a.. Single linux director with VIP of 192.168.0.240 and RIP of 192.168.0.16
b.. Two realservers with RIP of 192.168.0.14 and RIP of 192.168.0.15, called
realserver1 and realserver2 respectively
c.. Total of 3 computers, using LVS-DR
d.. Realservers running tomcat with ssl, each realserver has a copy of the
ssl certificate, and the director does not have a certificate. Sessions are
managed with a tomcat cluster.
e.. The director has ldirectord running, and no heartbeat/director failover
for purposes of this problem
f.. ldirectord uses a negotiate page from tomcat to monitor realserver health
g.. Debian Sarge current version, 3.1, with 2.4.27 kernel. uname -r output:
2.4.27-2-386. All 3 machines using same o/s
h.. If I connect 3 client computers to my server farm the system works well,
and the load is balanced.
i.. The load balancer can change the client from one server to another
mid-session and this works fine too.
The problem: If I disconnect realserver1 by pulling out its ethernet cable,
clients connected to realserver2 are ok, but clients connected to realserver1
are not. When I say ok, I mean that if a client is logged into my site,
(connected to a tomcat server) the client can click another link (which
requires the client to remain logged into my site and the session valid) and
another page loads up fine. When I say not ok, I mean that if a client
connected to realserver1 clicks the same link the new page does not load. BUT:
if the client clicks that same link again the page does load up fine.
Important: A second click loads the page. If I remove realserver1 and then wait
one minute, the failover is perfect, and the page loads from the new server on
the first click. Its only a problem in the first 45 seconds or so. I am running
ldirectord in debug mode, and I can see on the screen that it detects the
missing server within 4 seconds. It then issues the ipvsadm commands to remove
it from the pool. I can run ipvsadm -L -n 10 seconds after removing realserver1
and realserver1 is gone from the server pool. So ipvs has been told that the
server is down, but it still routes packets to it for another 30 seconds or so.
I have used tcpdump on realserver2, and the first click does not arrive at it.
The second click does. I think ipvs is routing the packet incorrectly, and it
is taking some 30 seconds to implement the ipvsadm command to take realserver1
out of the pool.
Is this normal? Is there any kind of setting I can change to make ipvs take
notice of the ipvsadm commands more quickly?
I tried reducing the tcp connection timeout to 5 seconds, and this tended to
make round robin swap the client from one server to another much more often
(which is fine), but then when I disconnect realserver1 clients connected to
realserver2 take 2 clicks to work, as if they were scheduled to change to
realserver1 just as realserver1 was removed. ipvs still takes 45 seconds to
register that realserver1 is gone and stop routing packets to it.
If I stop tomcat using /etc/init.d/tomcat stop rather than pulling out the
network cable the problem is not there. I wonder if it could have to do with my
tomcat sessions, however this does not explain why the packet does not arrive
at the working realserver.
ldirectord.cf:
checktimeout=3
negotiatetimeout=3
checkinterval=3
autoreload=no
logfile="/var/log/ldirectord.log"
quiescent=no
#Our custom service on 10004 also running, and taken offline when tomcat is
taken offline
virtual=192.168.0.240:10004
real=192.168.0.14:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
real=192.168.0.15:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
service=http
scheduler=rr
protocol=tcp
checktype=negotiate
checkport=8080
virtual=192.168.0.240:443
real=192.168.0.14:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
real=192.168.0.15:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
service=http
scheduler=rr
protocol=tcp
checktype=negotiate
checkport=8080
Note that I do the negotiate check on port 8080, as otherwise ldirectord pauses
(no logs on screen) for a minute or so when realserver1 is disconnected and
does not issue the ipvsadm commands to take realserver1 out of the pool until
after 1 minute.
output of #iptables --list:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
|