Erroneous RSTs and temporary intermittent loss of service

To:	lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject:	Erroneous RSTs and temporary intermittent loss of service
Cc:	Arthur Bergman <abergman@xxxxxxxxxxx>
From:	James Bromberger <jbromberger@xxxxxxxxxxx>
Date:	Thu, 20 May 2004 18:08:46 +0100

Hello all,

We've been running LVS (with Piranha) for several years now, and in thelest six months or so we've started to see problems, and our attemptsto diagnose it are pointing more and more at LVS.

Our configuration is reasonably simple: we have a pair of hosts (webservers) on a private network, and a pair of gateway boxes (HA, pulsebetween them) running LVS and Piranha. Service on port 80 does LWC withdynamic tuning by Piranha. When one load balancer is active, it hasvirtual interfaces up on the external physical interface (one per VIP),and one virtual interface on the internal (its outbound router IP forthe private network). If one load balancer is dead, pulse on the secondLB host brings up the virtual interfaces required to operate.

Our symptoms are that, every so often, all connections through the hostrunning as a lad balancer hang, no new connections get through, andeventually connections timeout. After about 1 - 2 minutes, servicereturns. We aren't seeing fail over between the LB hosts (pulse)...

We've done some packet analysis on the internal and external interfacesof the load balancer host, and can see a spurious set of SYNs, ACKs,and RSTs being thrown about. Looking at the event in order, I'lloutline the network:



xxx.xxx.72.115 is a client browser
xxx.xxx.72.2 is a load balancer's real IP that it keeps
xxx.xxx.72.3 is the HA standby load balancer's real IP that it keeps

xxx.xxx.72.70 is a VIP for our web service, floating between the above2 hosts


xxx.xxx.0.70 is a web server on a private network
xxx.xxx.0.92 is another web server on the private network

So, in normal practice, 72.115 (client) talks to 72.70 (which is reallyhost 72.2), which NATs to either 0.70 or 0.92. During these problems,we see 72.115 talk to 72.70 (SYN), and then 72.2 reply to 72.115 (ACK),which is odd as far as 72.115 is concerned because its not trying totalk to 72.2, it wanted 72.70, so it sends a RST to 72.2, and triesagain with 72.70. The LB however also sends the RST addressed to 72.2to the realserver.

When operating correctly, using ipvsadm we see about 10 active andseveral hundred inactive connections, however when this problemhappens, we get zero active and very low inactive connections.

We haven't enabled drop_rate, drop_packet, or secure_tcp, and I amwondering if I should enable one (or more) of these.

How can I debug further what is going on, or can anyone recognise/guesswhat is happening?


More information or tcpdumps on request.

Regards,
 James Bromberger

<Prev in Thread]	Current Thread	[Next in Thread>
Erroneous RSTs and temporary intermittent loss of service, James Bromberger <=

Previous by Date:	Re: LVS and dynamic routing, James Couzens
Next by Date:	Connection Synchronisation over serial line?, Patrick LeBoutillier
Previous by Thread:	RE: problem moving LVS NAT cluster to iptables - solved?, Peter Mueller
Next by Thread:	Connection Synchronisation over serial line?, Patrick LeBoutillier
Indexes:	[Date] [Thread] [Top] [All Lists]