Hi Julian,
Here are my vs settings
# grep . /proc/sys/net/ipv4/vs/*
/proc/sys/net/ipv4/vs/am_droprate:10
/proc/sys/net/ipv4/vs/amemthresh:2048
/proc/sys/net/ipv4/vs/cache_bypass:0
/proc/sys/net/ipv4/vs/debug_level:0
/proc/sys/net/ipv4/vs/drop_entry:0
/proc/sys/net/ipv4/vs/drop_packet:0
/proc/sys/net/ipv4/vs/expire_nodest_conn:0
/proc/sys/net/ipv4/vs/nat_icmp_send:0
/proc/sys/net/ipv4/vs/secure_tcp:0
/proc/sys/net/ipv4/vs/sync_threshold:3
/proc/sys/net/ipv4/vs/timeout_close:10
/proc/sys/net/ipv4/vs/timeout_closewait:60
/proc/sys/net/ipv4/vs/timeout_established:480
/proc/sys/net/ipv4/vs/timeout_finwait:60
/proc/sys/net/ipv4/vs/timeout_icmp:60
/proc/sys/net/ipv4/vs/timeout_lastack:30
/proc/sys/net/ipv4/vs/timeout_listen:120
/proc/sys/net/ipv4/vs/timeout_synack:100
/proc/sys/net/ipv4/vs/timeout_synrecv:10
/proc/sys/net/ipv4/vs/timeout_synsent:60
/proc/sys/net/ipv4/vs/timeout_timewait:60
/proc/sys/net/ipv4/vs/timeout_udp:180
No firewall rules, fwmarking, NAT, or bridging. No extra patches to IPVS.
CONFIG_IP_NF_IPTABLES, CONFIG_IP_NF_NAT, and CONFIG_BRIDGE are kernel modules
which are not loaded.
# lsmod
Module Size Used by Tainted: P
ip_vs_ftp 5956 0
ip_vs_wlc 1604 4 (autoclean)
ip_vs 73812 7 (autoclean) [ip_vs_ftp ip_vs_wlc]
sg 36460 0 (autoclean)
dcdesm 36124 1
dcdbas 40184 1
autofs 13460 0 (autoclean) (unused)
bcm5700 106952 0 (unused)
e100 57028 2
> From your explanation ip_vs_ftp leads to problems where SYN
> creates web connection, it is hashed in table, DNAT-ed to RS, then RS
> replies SYN+ACK which can not match the connection in table, it looks
> like this connection is not present (may be removed, do you see something
> in debug logs from the SYN to the SYN+ACK) or hash table is damaged.
The above sounds correct. Once again, here is the debug log. It looks like the
incoming packet is hit, however the outgoing packet is not. See my first email
for the tcpdump's.
Aug 13 03:20:43 kernel: IPVS: lookup/in TCP 216.220.XX.XXX:9345->10.99.23.64:80
hit
Aug 13 03:20:43 kernel: IPVS: Incoming TCP 216.220.XX.XXX:9345->10.99.23.64:80
Aug 13 03:20:43 kernel: Enter: ip_vs_nat_xmit, ip_vs_conn.c line 680
Aug 13 03:20:43 kernel: IPVS: NAT to 10.99.22.53:80
Aug 13 03:20:43 kernel: Leave: ip_vs_nat_xmit, ip_vs_conn.c line 820
Aug 13 03:20:43 kernel: Enter: ip_vs_out, ip_vs_core.c line 646
Aug 13 03:20:43 kernel: IPVS: lookup/out TCP
10.99.22.53:80->216.220.XX.XXX:9345 not hit
Aug 13 03:20:43 kernel: IPVS: packet for TCP 216.220.XX.XXX:9345 continue
traversal as normal.
Aug 13 03:20:43 kernel: Enter: ip_vs_out, ip_vs_core.c line 646
Aug 13 03:20:43 kernel: IPVS: lookup/out TCP
216.220.XX.XXX:9345->10.99.22.53:80 not hit
Aug 13 03:20:43 kernel: IPVS: packet for TCP 10.99.22.53:80 continue traversal
as normal.
> Do you still think it is caused by ip_vs_ftp? About your tests, is the
> client IP on lan? Do you think this client IP has many connections to
> the director?
The client IP is not on the LAN. The problem occurs from any source IP trying
to visit a load balanced VIP. Whenever we add the FTP service to ipvsadm, and
begin load balancing to it, the problem begins to occur on all services.
However, it is not consistent. Some outgoing SYN+ACK packets will get
translated correctly for a certain period of time, then after awhile some
packets will not be translated.
I do not think it is load related. We have other load balancers built from the
same image handling many more connections.
# ipvsadm -l -n
IP Virtual Server version 1.0.11 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.99.23.64:80 wlc persistent 300
-> 10.99.22.53:80 Masq 1 13 14
-> 10.99.22.58:80 Masq 1 15 3
-> 10.99.22.215:80 Masq 1 14 1
TCP 10.99.23.54:80 wlc persistent 300
TCP 10.99.23.51:80 wlc persistent 300
-> 10.99.22.199:80 Masq 1 30 6
-> 10.99.22.197:80 Masq 1 32 4
TCP 10.99.23.98:5061 wlc
-> 10.99.22.252:5061 Masq 1 0 0
-> 10.99.22.251:5061 Masq 1 0 0
Here is the output from the FTP service, which is not currently in the ipvsadm
table because of the problems it's causing.
TCP 10.99.23.57:21 wlc
-> 10.99.22.208:21 Masq 1 0 0
-> 10.99.22.207:21 Masq 1 0 0
Because this is a production environment, I cannot make very many changes or
further test the FTP service. At the moment, we are not load balancing FTP
because of the problems it creates. I have tried to reproduce this in the lab
using an image of the production load balancer. Unfortunately, I've had no luck
getting the problem to occur in the lab. I do not have access to the web and
FTP servers, and that is preventing me from fully reproducing the production
environment. That may have an effect on the validity of the tests.
Any more ideas? Thanks!
Jari
|