LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re(2): LVS stops balancing after a while

To: <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re(2): LVS stops balancing after a while
From: Mathieu Massebœuf <mathieu.masseboeuf@xxxxxxxxxxxx>
Date: Tue, 7 Feb 2006 13:50:32 +0100
Hi,

>Which kernel was this? What's the timeframe for this to happen, roughly?
The kernel is 2.6.14 patched with cks5 server performance patches (from
<http://ck.kolivas.org/patches/2.6/2.6.14/2.6.14-ck5/>)

>> Setup is done the following way (heartbeat calling ldirectord), here is
>> the conf :
>Ok, so this is plain LVS_DR without persistency. Hmm, you said you were 
>having problems with CPU and memory. How did this manifest itself?
The setup was running on old celeron boxes with 256Mb of ram.
Under heavy usage the box was starting to swap (low mem) and load was
climbing sky high.
The spare load balancer ending taking over the shared IP and was not
dying because if was on a smaller link (shared 100Mbits instead of
dedicated 100Mbits)

The 2 loads are now running on Xeon 2.8Ghz boxes with 512Mb of ram (I
may add some more if I see swap being used)
The highest input I had is 230Mbits/sec for now (the link being gigabit,
the ethernet cards intel pro 1000 - Intel Corp. 82541GI/PI Gigabit
Ethernet Controller (rev 05) to be more precise)

>> When the issue is happening, ipvsadm -L -n outputs 0 ActiveConn and 0
>> InActConn
>Gulp. For all RS? Setting the values to zero happens only, when:
>a) A new RS is added (maybe previously administratively removed)
>b) A RS is quiesced, after a certain amount of time the counter are zero
>Could you please give us more output when it happens again?
As you said, Gulp :)

The LVS have been runing since about 60 days
The issue happened last week-end on saturday arround 8:30PM (GMT+1).
The webserver getting all the traffic got his log partition filled
quickly, and unhopefully only level1 support was here and they didn't
understood what was going on - all they did was compress / backup the
logs and wonder why everything was slow.

I "fixed" the issue monday morning when I became aware of it (longest
issue ever for us ...).
Once logged on the LVS Active and Inactive connextions were set to 0,
even for the web server getting the traffic (web3).
I checked web3 (213.x.y.43 / 172.16.x.43) - which was behaving normally.
Stopping httpd on it made the whole site go down. Restarting it up and
it was up again.

After thinking about it, this seem like an ARP issue, the web server
having taken over the virtual IP address.
It would explain the 0 on the load balancer.
The strange thing is that this server is one of those running since 4
years (we added only 2 new ones) - and which never caused issues.
I'm trying to figure out where that bad ARP addy came from.

>> When it's not happening, each server have a lot of connections, 0 is not
>> possible, for example right now (which is low traffic) :
>>   -> 172.16.x.50:www  Route   25   619   
3462                                
>> I noticed all the traffic was going to the same box as the logs were
>> filling quickly - and as stopping httpd on that box made the whole site
>> to go down.
>
>:) Not a nice way to wake up. But we need some more in-situ information. 
>So next time, please collect the output of:
>
>ipvsadm -L -n
That was as I said, nothing particalar regarding weight, but Active and
Inactive connections were all to 0
>dmesg
I have this - but I didn't saw anything particular for now (except the
Redirects and the ttyS0 logs).
>tcpump
I don't have that, should have done it to know at least if I have
traffic comming in (arp issue or not)
>logfiles related to your setup
I backed that up as well, grepping in there .
* There is only one thing suspicious (and older than the issue) in
ldirectord.log :
Feb  1 13:20:10 localhost ldirectord[17133]: Exiting with exit_status
-1: Could not run /sbin/ipvsadm -L -n
* The following is not usual in messages as well - and the martian
source correspond to the IP of the webserver which cought all the traffic :
Feb  4 23:00:08 localhost heartbeat[16762]: WARN: string2msg_ll: node
[load2] failed authentication
Feb  4 23:00:08 localhost heartbeat[16762]: WARN: string2msg_ll: node
[load2] failed authentication
Feb  4 23:15:04 localhost kernel: martian source 213.x.y.33 from 213.x.y.
40, on dev eth0
Feb  4 23:15:04 localhost kernel: ll header: ff:ff:ff:ff:ff:ff:00:06:5b:
8c:f8:c5:08:06                      
                                    
                                                                        

>and if possible, enable vs_debug and dump the kernlog output somewhere.
You mean changing /proc/sys/net/ipv4/vs/debug_level to an higher level ?

>> Considering I was in an urgent situation, I couldn't have much time to
>> investigate more - what I did to go back up was a stop / start of
>> heartbeat, in the meantime the second load balancer would have taken
>> over the situation and then given it back.
>> After that everything seemed normal.
>Strange. Looks like some kind of soft deadlock.
Or ARP issue - the stop start will tell the other load to take over the
IP and then give it back.

>> A quick investigation of the logs didn't revel anything strange (I
>> copied everything I could for further investigation), appart from the
>> following (one line only) :
>> Redirect from 213.255.89.122 on eth0 about 213.255.89.128 ignored.
>>   Advised path = 213.x.y.k (load2) -> 213.255.89.128, tos 00
>Ahh, so you have NOTRACK enabled? And someone is doing funky routing 
>tricks on your collision domain. What are your icmp related proc-fs 
>settings?
Someone is playing yes, I have a lot of those.
But it shouldn't be an issue.
Something to consider as well, I have hidden flag set for arp on my web
servers - so they don't answer to ARP queries on interfaces which don't
have an IP setup related to the query (lo0 is set to my virtual IP on
the web servers)

>grep . /proc/sys/net/ipv4/icmp*
Posted here : <http://paste.lisp.org/display/16545>
                                    
                                                                           
>grep . /proc/sys/net/ipv4/conf/{all,eth0,eth1}/*
Posted here (it's long): <http://paste.lisp.org/display/16546>
eth0 is external, eth1 is internal

Maybe servers infos could be informative, here is the load balancer
sysctl.conf :
<http://paste.lisp.org/display/16547>
The "old servers" sysctl (those are getting their traffic via their
external interface for some old reasons, I have to change that - eth0
internal, eth1 external)
<http://paste.lisp.org/display/16549>
The "new servers" sysctl (those are getting their traffic via an
internal LAN and answering externaly - eth0 external, eth1 internal)
<http://paste.lisp.org/display/16548>


>> ttyS0: 1 input overrun(s) (more of those)
>Is this your heartbeat?
Yes, via serial (plus broadcast). Not sure why I have that message (will
check out once the more problematic stuffs are fixed)

>> As Jan said, any help is appreciated, and thanks for reading this
>> borring mail :D
>> (Which will hopefully be less borring if we find the cause of the proble)
>This is of course not boring. Please share some more of your logs if you 
>still have them, especially heartbeat log entries.
Thanks a lot :)
Here are the heartbeat logs (light, debug was not enabled) - those don't
seem unusual :
<http://paste.lisp.org/display/16550>

Thanks,
--
Mathieu Masseboeuf


<Prev in Thread] Current Thread [Next in Thread>