LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

[lvs-users] Problem: Occasionally all traffic gest sent to a single back

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: [lvs-users] Problem: Occasionally all traffic gest sent to a single backend node
From: "kotobuki intl" <kbintl2@xxxxxxxxx>
Date: Fri, 18 Jan 2008 18:35:03 +0900
We have a number of LVS clusters setup and have been running them
successfully for a few years, however occasionally we notice that the
load balancing plays up and the traffic becomes unbalanced.  What
happens is all requests start getting sent to a single backend server
and the other backend servers get no traffic.

It doesn't matter what scheduler is used (ie: rr, wrr, lc, wlc) we
still ocassionally see this pattern.  Also its not the monitoring
script taking the backed nodes off the ipvs table.  Our LVS directors
run CentOS 4.6  and OpenSUSE 10.1 (unrelated clusters) but we see it
happen on both.

The following URL [
http://img208.imageshack.us/my.php?image=lvsunbalancedsf2.gif ], has a
picture of the backend servers monitoring and you can see at about
2:00 traffic from three of the backend servers drops off to nothing
and "server 4" starts handling all traffic.  Whilst this occurred
ipvsadm gave the following output;

$ ipvsadm -ln
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  xxx.xxx.xxx.102:80 rr
  -> xxx.xxx.xxx.9:80              Route   100    0          0
  -> xxx.xxx.xxx.10:80             Route   100    0          0
  -> xxx.xxx.xxx.11:80             Route   100    0          0
  -> xxx.xxx.xxx.12:80             Route   100    0          0
TCP  xxx.xxx.xxx.103:80 rr
  -> xxx.xxx.xxx.12:80             Route   100    0          0
  -> xxx.xxx.xxx.11:80             Route   100    0          0
  -> xxx.xxx.xxx.10:80             Route   100    0          0
  -> xxx.xxx.xxx.9:80              Route   100    0          0
TCP  xxx.xxx.xxx.100:80 rr
  -> xxx.xxx.xxx.9:80              Route   100    0          0
  -> xxx.xxx.xxx.10:80             Route   100    0          0
  -> xxx.xxx.xxx.11:80             Route   100    0          0
  -> xxx.xxx.xxx.12:80             Route   100    0          0
TCP  xxx.xxx.xxx.101:80 rr
  -> xxx.xxx.xxx.12:80             Route   100    1          1
  -> xxx.xxx.xxx.11:80             Route   100    6          0
  -> xxx.xxx.xxx.10:80             Route   100    1          3
  -> xxx.xxx.xxx.9:80              Route   100    1          0
TCP  xxx.xxx.xxx.104:80 rr
  -> xxx.xxx.xxx.9:80              Route   100    0          0
  -> xxx.xxx.xxx.10:80             Route   100    0          0
  -> xxx.xxx.xxx.11:80             Route   100    0          0
  -> xxx.xxx.xxx.12:80             Route   100    0          0

The connection column's have pretty much dropped to zero, even though
traffic is still being served on the site.  Note that server4 in the
graph corresponds to xxx.xxx.xxx.12.

To get around this problem we run a script which monitors the VIP and
if the HTTP requests take to long  (due to the single backend node
being overloaded) we cause heartbeat to standby and swap to the VIP's
to the other LVS director.  That then causes everything to be
rebalanced.

Anyhow does anyone else have this problem and ideally a fix?


Thanks,
Paul


<Prev in Thread] Current Thread [Next in Thread>