Hi,
I'm actualy in a similar issue as yours - using direct routing and wlc :
after a while connections stop being load balanced and go to a single server.
Since a few month I've upgraded our LVS infrastructure, which is made of
2 LVS servers and 10 web servers - I had no issue before except cpu/mem
resources (the setup was 3 years old).
The load balancers now run Debian sarge in a way similar to yours, with
the following :
- Ldirectord 1.2.3-9sarge4
- Heartbeat 1.2.3-9sarge4 , checking via a serial cable plus broadcast
on the internal lan
- Ipvsadm 1.24+1.21-1 (for ipvs_syncmaster and ipvs_syncbackup cluster
synchronization)
- Kernel 2.6.14 (non debian)
Setup is done the following way (heartbeat calling ldirectord), here is
the conf :
checktimeout=6
checkinterval=3
autoreload=yes
logfile="local3"
quiescent=yes
# HTTP Virtual Service
virtual=213.x.y.z:80
real=172.16.x.41:80 gate 10
real=172.16.x.42:80 gate 10
real=172.16.x.43:80 gate 10
real=172.16.x.44:80 gate 10
real=172.16.x.45:80 gate 10
real=172.16.x.46:80 gate 10
real=172.16.x.47:80 gate 18
real=172.16.x.48:80 gate 18
real=172.16.x.49:80 gate 25
real=172.16.x.50:80 gate 25
service=http
virtualhost="domain.com"
request="/.testpage"
receive="Test Page"
scheduler=wlc
#persistent=600
protocol=tcp
checktype=negotiate
When the issue is happening, ipvsadm -L -n outputs 0 ActiveConn and 0
InActConn
When it's not happening, each server have a lot of connections, 0 is not
possible, for example right now (which is low traffic) :
-> 172.16.x.50:www Route 25 619
3462
I noticed all the traffic was going to the same box as the logs were
filling quickly - and as stopping httpd on that box made the whole site
to go down.
Considering I was in an urgent situation, I couldn't have much time to
investigate more - what I did to go back up was a stop / start of
heartbeat, in the meantime the second load balancer would have taken
over the situation and then given it back.
After that everything seemed normal.
A quick investigation of the logs didn't revel anything strange (I
copied everything I could for further investigation), appart from the
following (one line only) :
Redirect from 213.255.89.122 on eth0 about 213.255.89.128 ignored.
Advised path = 213.x.y.k (load2) -> 213.255.89.128, tos 00
ttyS0: 1 input overrun(s) (more of those)
As Jan said, any help is appreciated, and thanks for reading this
borring mail :D
(Which will hopefully be less borring if we find the cause of the proble)
--
Mathieu Massebœuf
|