Hi,
I am using LVS-DR to load balance a TCP-based application, and it appears that
the LVS is losing connection information. (timing out and being removed from
the connection hash table?) (At the end I've included some tcpdump/snoop info)
Is there any way to dump the connection table to see if that's really the case?
We have a client app that makes a connection to the server app, running on port
12000.
Everything works great except that over time, when I do a netstat on the server,
I see alot of connections in "ESTABLISHED", but when I do a netstat on the
client, it only shows the number of connections that are really there (only 1
per client)
I think that the server app could use TCP_KEEPALIVE to remedy this situation,
but my developers say that it isn't supported in the version of JAVA they are
using(1.2.2 I believe)
Our client app is designed to hold a connection open to the server indefinately,
while it may not transmit data over that connection for a very long time.
A couple of possible resolutions..?
1. If it's an LVS timeout issue, crank up the right timeout value ( > 1 day)
(not sure which one it is??)
2. Modify the client to send a "hello there" to the server every so often.
(developers not happy about this one)
3. Upgrade JAVA, use TCP_KEEPALIVE on the server app.
FYI- the client is linux, LVS is RedHat(2.2.16-3 kernel), and the realserver is
solaris.
Any suggestions? comments?
Thanks,
-Ray
Other Details:
First some IPVSADM info: (It's a test env, that's why there's only 1 server.
[root@lvs1 rayp]# ipvsadm
IP Virtual Server version 0.9.12 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP dc84.digitalcyclone.com:12000 wlc
-> dc29.digitalcyclone.com:12000 Route 1 0 0
For a test, I killed one of the clients off:
**STEP 1: determine which port(1939) to monitor and which PID(1899) to kill.
[dc@sdcs ~]$ netstat -ap | grep 12000
tcp 0 0 sdcs:1939 gridsvr:12000 ESTABLISHED 1899/dcs
[dc@sdcs ~]$ kill 1899
**NOW IT'S DEAD (or dying anyway)
[dc@sdcs ~]$ netstat -ap | grep 12000
tcp 0 62 sdcs:1939 gridsvr:12000 FIN_WAIT1 -
**STEP 2: netstat on gridsvr (notice that it's still there in ESTABLISHED, will
be there forever)
gridsvr# netstat -a | grep 1939
gridsvr.12000 sdcs.1939 32120 0 10136 0 ESTABLISHED
**STEP 3: Here are the TCPDUMPS running on the 3 machines involved (while I
issue the kill of the client)
** ON THE CLIENT. packets go out to gridsvr
[root@sdcs /root]# tcpdump -i eth0 tcp port 1939
Kernel filter, protocol ALL, datagram packet socket
tcpdump: listening on eth0
16:42:02.130022 > sdcs.1939 > gridsvr.12000: P 483508496:483508557(61) ack
1618010192 win 32120 <no)
16:44:02.130022 > sdcs.1939 > gridsvr.12000: P 0:61(61) ack 1 win 32120
<nop,nop,timestamp 71571933 249897)
16:46:02.130022 > sdcs.1939 > gridsvr.12000: P 0:61(61) ack 1 win 32120
<nop,nop,timestamp 71583933 249897)
** ON THE LVS. packets come into LVS.
[root@lvs1 rayp]# tcpdump -i eth0 tcp port 1939 and host 216.245.140.89
Kernel filter, protocol ALL, datagram packet socket
tcpdump: listening on eth0
05:43:36.735041 < sdcs.1939 > gridsvr.com.12000: P 483508496:483508557(61) ack
1618010192 win 32120 <nop,nop,timestamp 71559933 249897356> (DF)
05:45:36.741938 < sdcs.1939 > gridsvr.12000: P 0:61(61) ack 1 win 32120
<nop,nop,timestamp 71571933 249897356> (DF)
05:47:36.748837 < sdcs.1939 > gridsvr.12000: P 0:61(61) ack 1 win 32120
<nop,nop,timestamp 71583933 249897356> (DF)
** ON THE SERVER. (anyone home...?)
gridsvr# snoop tcp port 1939
Using device /dev/hme (promiscuous mode)
(nothing)
|