Hello,
I'm actualy in a similar issue as yours - using direct routing and wlc :
after a while connections stop being load balanced and go to a single server.
Which kernel was this? What's the timeframe for this to happen, roughly?
Since a few month I've upgraded our LVS infrastructure, which is made of
2 LVS servers and 10 web servers - I had no issue before except cpu/mem
resources (the setup was 3 years old).
The load balancers now run Debian sarge in a way similar to yours, with
the following :
- Ldirectord 1.2.3-9sarge4
- Heartbeat 1.2.3-9sarge4 , checking via a serial cable plus broadcast
on the internal lan
- Ipvsadm 1.24+1.21-1 (for ipvs_syncmaster and ipvs_syncbackup cluster
synchronization)
- Kernel 2.6.14 (non debian)
Nice setup.
Setup is done the following way (heartbeat calling ldirectord), here is
the conf :
checktimeout=6
checkinterval=3
autoreload=yes
logfile="local3"
quiescent=yes
# HTTP Virtual Service
virtual=213.x.y.z:80
real=172.16.x.41:80 gate 10
real=172.16.x.42:80 gate 10
real=172.16.x.43:80 gate 10
real=172.16.x.44:80 gate 10
real=172.16.x.45:80 gate 10
real=172.16.x.46:80 gate 10
real=172.16.x.47:80 gate 18
real=172.16.x.48:80 gate 18
real=172.16.x.49:80 gate 25
real=172.16.x.50:80 gate 25
service=http
virtualhost="domain.com"
request="/.testpage"
receive="Test Page"
scheduler=wlc
#persistent=600
protocol=tcp
checktype=negotiate
Ok, so this is plain LVS_DR without persistency. Hmm, you said you were
having problems with CPU and memory. How did this manifest itself?
When the issue is happening, ipvsadm -L -n outputs 0 ActiveConn and 0
InActConn
Gulp. For all RS? Setting the values to zero happens only, when:
a) A new RS is added (maybe previously administratively removed)
b) A RS is quiesced, after a certain amount of time the counter are zero
Could you please give us more output when it happens again?
When it's not happening, each server have a lot of connections, 0 is not
possible, for example right now (which is low traffic) :
-> 172.16.x.50:www Route 25 619 3462
I noticed all the traffic was going to the same box as the logs were
filling quickly - and as stopping httpd on that box made the whole site
to go down.
:) Not a nice way to wake up. But we need some more in-situ information.
So next time, please collect the output of:
ipvsadm -L -n
dmesg
tcpump
logfiles related to your setup
and if possible, enable vs_debug and dump the kernlog output somewhere.
Considering I was in an urgent situation, I couldn't have much time to
investigate more - what I did to go back up was a stop / start of
heartbeat, in the meantime the second load balancer would have taken
over the situation and then given it back.
After that everything seemed normal.
Strange. Looks like some kind of soft deadlock.
A quick investigation of the logs didn't revel anything strange (I
copied everything I could for further investigation), appart from the
following (one line only) :
Redirect from 213.255.89.122 on eth0 about 213.255.89.128 ignored.
Advised path = 213.x.y.k (load2) -> 213.255.89.128, tos 00
Ahh, so you have NOTRACK enabled? And someone is doing funky routing
tricks on your collision domain. What are your icmp related proc-fs
settings?
grep . /proc/sys/net/ipv4/icmp*
grep . /proc/sys/net/ipv4/conf/{all,eth0}/*
ttyS0: 1 input overrun(s) (more of those)
Is this your heartbeat?
As Jan said, any help is appreciated, and thanks for reading this
borring mail :D
(Which will hopefully be less borring if we find the cause of the proble)
This is of course not boring. Please share some more of your logs if you
still have them, especially heartbeat log entries.
Best regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
|