LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Erroneous RSTs and temporary intermittent loss of service

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Erroneous RSTs and temporary intermittent loss of service
Cc: Arthur Bergman <abergman@xxxxxxxxxxx>
From: James Bromberger <jbromberger@xxxxxxxxxxx>
Date: Thu, 20 May 2004 18:08:46 +0100
Hello all,

We've been running LVS (with Piranha) for several years now, and in the lest six months or so we've started to see problems, and our attempts to diagnose it are pointing more and more at LVS.


Our configuration is reasonably simple: we have a pair of hosts (web servers) on a private network, and a pair of gateway boxes (HA, pulse between them) running LVS and Piranha. Service on port 80 does LWC with dynamic tuning by Piranha. When one load balancer is active, it has virtual interfaces up on the external physical interface (one per VIP), and one virtual interface on the internal (its outbound router IP for the private network). If one load balancer is dead, pulse on the second LB host brings up the virtual interfaces required to operate.

Our symptoms are that, every so often, all connections through the host running as a lad balancer hang, no new connections get through, and eventually connections timeout. After about 1 - 2 minutes, service returns. We aren't seeing fail over between the LB hosts (pulse)...

We've done some packet analysis on the internal and external interfaces of the load balancer host, and can see a spurious set of SYNs, ACKs, and RSTs being thrown about. Looking at the event in order, I'll outline the network:


xxx.xxx.72.115 is a client browser
xxx.xxx.72.2 is a load balancer's real IP that it keeps
xxx.xxx.72.3 is the HA standby load balancer's real IP that it keeps

xxx.xxx.72.70 is a VIP for our web service, floating between the above 2 hosts

xxx.xxx.0.70 is a web server on a private network
xxx.xxx.0.92 is another web server on the private network


So, in normal practice, 72.115 (client) talks to 72.70 (which is really host 72.2), which NATs to either 0.70 or 0.92. During these problems, we see 72.115 talk to 72.70 (SYN), and then 72.2 reply to 72.115 (ACK), which is odd as far as 72.115 is concerned because its not trying to talk to 72.2, it wanted 72.70, so it sends a RST to 72.2, and tries again with 72.70. The LB however also sends the RST addressed to 72.2 to the realserver.

When operating correctly, using ipvsadm we see about 10 active and several hundred inactive connections, however when this problem happens, we get zero active and very low inactive connections.

We haven't enabled drop_rate, drop_packet, or secure_tcp, and I am wondering if I should enable one (or more) of these.

How can I debug further what is going on, or can anyone recognise/guess what is happening?

More information or tcpdumps on request.

Regards,
 James Bromberger

<Prev in Thread] Current Thread [Next in Thread>
  • Erroneous RSTs and temporary intermittent loss of service, James Bromberger <=