LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Realserver failover problem using ssl and tomcat

To: <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Realserver failover problem using ssl and tomcat
From: "Jason Downing" <jasondowning@xxxxxxxxxxxxxxxx>
Date: Wed, 28 Jun 2006 11:30:33 +1000
Setup: 
  a.. Single linux director with VIP of 192.168.0.240 and RIP of 192.168.0.16
  b.. Two realservers with RIP of 192.168.0.14 and RIP of 192.168.0.15, called 
realserver1 and realserver2 respectively
  c.. Total of 3 computers, using LVS-DR
  d.. Realservers running tomcat with ssl, each realserver has a copy of the 
ssl certificate, and the director does not have a certificate. Sessions are 
managed with a tomcat cluster.
  e.. The director has ldirectord running, and no heartbeat/director failover 
for purposes of this problem
  f.. ldirectord uses a negotiate page from tomcat to monitor realserver health
  g.. Debian Sarge current version, 3.1, with 2.4.27 kernel. uname -r output: 
2.4.27-2-386. All 3 machines using same o/s
  h.. If I connect 3 client computers to my server farm the system works well, 
and the load is balanced.
  i.. The load balancer can change the client from one server to another 
mid-session and this works fine too.
The problem:  If I disconnect realserver1 by pulling out its ethernet cable, 
clients connected to realserver2 are ok, but clients connected to realserver1 
are not. When I say ok, I mean that if a client is logged into my site, 
(connected to a tomcat server) the client can click another link (which 
requires the client to remain logged into my site and the session valid) and 
another page loads up fine. When I say not ok, I mean that if a client 
connected to realserver1 clicks the same link the new page does not load. BUT: 
if the client clicks that same link again the page does load up fine.

Important: A second click loads the page. If I remove realserver1 and then wait 
one minute, the failover is perfect, and the page loads from the new server on 
the first click. Its only a problem in the first 45 seconds or so. I am running 
ldirectord in debug mode, and I can see on the screen that it detects the 
missing server within 4 seconds. It then issues the ipvsadm commands to remove 
it from the pool. I can run ipvsadm -L -n 10 seconds after removing realserver1 
and realserver1 is gone from the server pool. So ipvs has been told that the 
server is down, but it still routes packets to it for another 30 seconds or so.

I have used tcpdump on realserver2, and the first click does not arrive at it. 
The second click does. I think ipvs is routing the packet incorrectly, and it 
is taking some 30 seconds to implement the ipvsadm command to take realserver1 
out of the pool.

Is this normal? Is there any kind of setting I can change to make ipvs take 
notice of the ipvsadm commands more quickly?

I tried reducing the tcp connection timeout to 5 seconds, and this tended to 
make round robin swap the client from one server to another much more often 
(which is fine), but then when I disconnect realserver1 clients connected to 
realserver2 take 2 clicks to work, as if they were scheduled to change to 
realserver1 just as realserver1 was removed. ipvs still takes 45 seconds to 
register that realserver1 is gone and stop routing packets to it.

If I stop tomcat using /etc/init.d/tomcat stop rather than pulling out the 
network cable the problem is not there. I wonder if it could have to do with my 
tomcat sessions, however this does not explain why the packet does not arrive 
at the working realserver.

ldirectord.cf:

checktimeout=3
negotiatetimeout=3
checkinterval=3
autoreload=no
logfile="/var/log/ldirectord.log"
quiescent=no
#Our custom service on 10004 also running, and taken offline when tomcat is 
taken offline
virtual=192.168.0.240:10004
        real=192.168.0.14:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
        real=192.168.0.15:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
        service=http
        scheduler=rr
        protocol=tcp
        checktype=negotiate
        checkport=8080
virtual=192.168.0.240:443
        real=192.168.0.14:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
        real=192.168.0.15:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
        service=http
        scheduler=rr
        protocol=tcp
        checktype=negotiate
        checkport=8080

Note that I do the negotiate check on port 8080, as otherwise ldirectord pauses 
(no logs on screen) for a minute or so when realserver1 is disconnected and 
does not issue the ipvsadm commands to take realserver1 out of the pool until 
after 1 minute.

output of #iptables --list:
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


<Prev in Thread] Current Thread [Next in Thread>