Re: LVS stops balancing after a while

To:	"LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject:	Re: LVS stops balancing after a while
From:	Roberto Nibali <ratz@xxxxxxxxxxxx>
Date:	Mon, 06 Feb 2006 21:50:33 +0100

Hello,

I'm actualy in a similar issue as yours - using direct routing and wlc :
after a while connections stop being load balanced and go to a single server.


Which kernel was this? What's the timeframe for this to happen, roughly?

Since a few month I've upgraded our LVS infrastructure, which is made of
2 LVS servers and 10 web servers - I had no issue before except cpu/mem
resources (the setup was 3 years old).

The load balancers now run Debian sarge in a way similar to yours, with
the following :
  - Ldirectord 1.2.3-9sarge4
  - Heartbeat 1.2.3-9sarge4 , checking via a serial cable plus broadcast
on the internal lan
  - Ipvsadm 1.24+1.21-1 (for ipvs_syncmaster and ipvs_syncbackup cluster
synchronization)
  - Kernel 2.6.14 (non debian)


Nice setup.

Setup is done the following way (heartbeat calling ldirectord), here is
the conf :
checktimeout=6
checkinterval=3
autoreload=yes
logfile="local3"
quiescent=yes

# HTTP Virtual Service
virtual=213.x.y.z:80
        real=172.16.x.41:80 gate 10
        real=172.16.x.42:80 gate 10
        real=172.16.x.43:80 gate 10
        real=172.16.x.44:80 gate 10
        real=172.16.x.45:80 gate 10
        real=172.16.x.46:80 gate 10
        real=172.16.x.47:80 gate 18
        real=172.16.x.48:80 gate 18
        real=172.16.x.49:80 gate 25
        real=172.16.x.50:80 gate 25
        service=http
        virtualhost="domain.com"
        request="/.testpage"
        receive="Test Page"
        scheduler=wlc
        #persistent=600
        protocol=tcp
        checktype=negotiate

Ok, so this is plain LVS_DR without persistency. Hmm, you said you werehaving problems with CPU and memory. How did this manifest itself?

When the issue is happening, ipvsadm -L -n outputs 0 ActiveConn and 0
InActConn


Gulp. For all RS? Setting the values to zero happens only, when:

a) A new RS is added (maybe previously administratively removed)
b) A RS is quiesced, after a certain amount of time the counter are zero

Could you please give us more output when it happens again?

When it's not happening, each server have a lot of connections, 0 is not
possible, for example right now (which is low traffic) :
-> 172.16.x.50:www Route 25 619 3462I noticed all the traffic was going to the same box as the logs were
filling quickly - and as stopping httpd on that box made the whole site
to go down.

:) Not a nice way to wake up. But we need some more in-situ information.So next time, please collect the output of:


ipvsadm -L -n
dmesg
tcpump
logfiles related to your setup

and if possible, enable vs_debug and dump the kernlog output somewhere.

Considering I was in an urgent situation, I couldn't have much time to
investigate more - what I did to go back up was a stop / start of
heartbeat, in the meantime the second load balancer would have taken
over the situation and then given it back.
After that everything seemed normal.


Strange. Looks like some kind of soft deadlock.

A quick investigation of the logs didn't revel anything strange (I
copied everything I could for further investigation), appart from the
following (one line only) :
Redirect from 213.255.89.122 on eth0 about 213.255.89.128 ignored.
  Advised path = 213.x.y.k (load2) -> 213.255.89.128, tos 00

Ahh, so you have NOTRACK enabled? And someone is doing funky routingtricks on your collision domain. What are your icmp related proc-fssettings?


grep . /proc/sys/net/ipv4/icmp*
grep . /proc/sys/net/ipv4/conf/{all,eth0}/*

ttyS0: 1 input overrun(s) (more of those)


Is this your heartbeat?

As Jan said, any help is appreciated, and thanks for reading this
borring mail :D
(Which will hopefully be less borring if we find the cause of the proble)

This is of course not boring. Please share some more of your logs if youstill have them, especially heartbeat log entries.


Best regards,
Roberto Nibali, ratz
--

echo'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread]	Current Thread	[Next in Thread>
LVS stops balancing after a while, Jan Jensen Re: LVS stops balancing after a while, Roberto Nibali Re: LVS stops balancing after a while, Jensen, Jan LVS???, Kirti S. Bajwa Re: LVS???, Joseph Mack NA3T Re: LVS stops balancing after a while, Roberto Nibali Re: LVS stops balancing after a while, Mathieu Massebœuf Re: LVS stops balancing after a while, Roberto Nibali <= Re(2): LVS stops balancing after a while, Mathieu Massebœuf Re(3): LVS stops balancing after a while, Mathieu Massebœuf

Previous by Date:	[resolved] Re: LVS-Tun : RS receive packets, but not "tunl0", Olivier Bonvalet
Next by Date:	*Re: [PATCH][RFC] Shrink ip_vs_.c includes*, Horms*
Previous by Thread:	Re: LVS stops balancing after a while, Mathieu Massebœuf
Next by Thread:	Re(2): LVS stops balancing after a while, Mathieu Massebœuf
Indexes:	[Date] [Thread] [Top] [All Lists]