Hello all,
We've been running LVS (with Piranha) for several years now, and in the
lest six months or so we've started to see problems, and our attempts
to diagnose it are pointing more and more at LVS.
Our configuration is reasonably simple: we have a pair of hosts (web
servers) on a private network, and a pair of gateway boxes (HA, pulse
between them) running LVS and Piranha. Service on port 80 does LWC with
dynamic tuning by Piranha. When one load balancer is active, it has
virtual interfaces up on the external physical interface (one per VIP),
and one virtual interface on the internal (its outbound router IP for
the private network). If one load balancer is dead, pulse on the second
LB host brings up the virtual interfaces required to operate.
Our symptoms are that, every so often, all connections through the host
running as a lad balancer hang, no new connections get through, and
eventually connections timeout. After about 1 - 2 minutes, service
returns. We aren't seeing fail over between the LB hosts (pulse)...
We've done some packet analysis on the internal and external interfaces
of the load balancer host, and can see a spurious set of SYNs, ACKs,
and RSTs being thrown about. Looking at the event in order, I'll
outline the network:
xxx.xxx.72.115 is a client browser
xxx.xxx.72.2 is a load balancer's real IP that it keeps
xxx.xxx.72.3 is the HA standby load balancer's real IP that it keeps
xxx.xxx.72.70 is a VIP for our web service, floating between the above
2 hosts
xxx.xxx.0.70 is a web server on a private network
xxx.xxx.0.92 is another web server on the private network
So, in normal practice, 72.115 (client) talks to 72.70 (which is really
host 72.2), which NATs to either 0.70 or 0.92. During these problems,
we see 72.115 talk to 72.70 (SYN), and then 72.2 reply to 72.115 (ACK),
which is odd as far as 72.115 is concerned because its not trying to
talk to 72.2, it wanted 72.70, so it sends a RST to 72.2, and tries
again with 72.70. The LB however also sends the RST addressed to 72.2
to the realserver.
When operating correctly, using ipvsadm we see about 10 active and
several hundred inactive connections, however when this problem
happens, we get zero active and very low inactive connections.
We haven't enabled drop_rate, drop_packet, or secure_tcp, and I am
wondering if I should enable one (or more) of these.
How can I debug further what is going on, or can anyone recognise/guess
what is happening?
More information or tcpdumps on request.
Regards,
James Bromberger
|