Re: Major issue with LVS-DR when a server gets overloaded

To:	"LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject:	Re: Major issue with LVS-DR when a server gets overloaded
From:	Roberto Nibali <ratz@xxxxxxxxxxxx>
Date:	Fri, 16 Feb 2007 11:38:10 +0100

Hello,

Either a massive bug in the ServerIron Firmware or a configurationglitch on your side. Care to post the relevant part of the configuration?
In the ServerIron, each of the 6 real servers looks like this :

server real server01.domain x.x.x.41
 port default disable
 weight 10 0
 port http
 port http keepalive
 port http url "GET /alarm/"

And this automatically gets /alarm/index.php as per configuration onyour lighttpd server?

 port http status_code  200 299
!
And the virtual server :

server virtual virtual.domain x.x.x.225
 port default dsr
 port http sticky
 port http dsr
 bind http server01.domain http server02.domain http server03.domain
http server04.domain http
 bind http server05.domain http server06.domain http
!

I don't exactly remember the FSM on the ServerIron hardware andunfortunately these days one does not get access to their documentationanymore, without a KP id :(. However, your configuration looks prettystraight-forward and should definitely work. I'm just not sure if theServerIron OS distinguishes between HTTP no response and HTTP notexpected response?

What happens if your modify your PHP health check status script toactually set code 500 for all HTTP requests? Do any of the RS get setup, either with the ServerIron or the LVS?

The similar configuration with LVS (using keepalived) :

I'm not too familiar with the inner workings of keepalived, so maybeAlexandre should throw an eye on this as well.

virtual_server x.x.x.229 80 {
    delay_loop 6


This seems pretty short, considering you've 6 RS to check.

    lb_algo rr
    lb_kind DR
    persistence_timeout 30
    protocol TCP

    real_server x.x.x.41 80 {
        weight 10
        HTTP_GET {
            url {
                path /alarm/
                status_code 200
            }
            connect_timeout 5
            nb_get_retry 2
            delay_before_retry 5
        }
    }

! etc. for all other 5

}

How exactly do you get your RS to dynamically switch from HTTP responsecode 200 to 500? Have you checked the HTTP response header using a CLItool like curl, lynx or wget?


Various ways. I'm using lighttpd with PHP as FastCGI, so by checking
a /alarm/index.php script :
- I get a 500 from lighttpd if the PHP backend is overloaded or dead
And right now I've extended this PHP script to keep sending 500s in
more situations, in order to avoid "plip-flopping" :
- I get a 500 from the script if the main db connection is down
- I get a 500 from the script of the server's 1min avg load is > 20

So what happens if you shut down all your DBs and restart keepalived?How does the ipvsadm -Ln output look like?

I've checked with "curl -I" and get the status I expect in every case.

Ok.

I would like to have tried some kind of "keep the real server disabled
for n seconds when it's detected as down" in order to keep the check
from flip-flopping like this, but there is no such setting in
keepalived AFAICS.
Would it be possible and good enough for you to use the thresholdlimitation feature by setting an upper and lower threshold for theamount of active + inactive connections?


I've got a bit more information after running LVS for the past weeks
(without sending any real traffic to the virtual server IP address,
though, I use the ServerIron's virtual IP address currently). I keep
getting read timeouts from keepalived, so at a higher level it seems
that there already is an issue. The ServerIron reports no similar
timeouts against the same servers, which are running fine.


Health check read timeouts?

Anyhow, this is something I definitely need to fix before digging any
more about the LVS issue I reported initially.


Fair enough. Good luck,
Roberto Nibali, ratz
--

echo'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread]	Current Thread	[Next in Thread>
Re: Major issue with LVS-DR when a server gets overloaded, Roberto Nibali Re: Major issue with LVS-DR when a server gets overloaded, Matthias Saou Re: Major issue with LVS-DR when a server gets overloaded, Roberto Nibali <=

Previous by Date:	Re: IPVS-persistance, weight and bringing down a realserver., Graeme Fowler
Next by Date:	Re: Problem with masq option for LVS - help very much appreciated!, Roberto Nibali
Previous by Thread:	Re: Major issue with LVS-DR when a server gets overloaded, Matthias Saou
Next by Thread:	Re: ipt_connlimit through LVS?, Roberto Nibali
Indexes:	[Date] [Thread] [Top] [All Lists]