LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Major issue with LVS-DR when a server gets overloaded

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Major issue with LVS-DR when a server gets overloaded
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Fri, 16 Feb 2007 11:38:10 +0100
Hello,

Either a massive bug in the ServerIron Firmware or a configuration glitch on your side. Care to post the relevant part of the configuration?

In the ServerIron, each of the 6 real servers looks like this :

server real server01.domain x.x.x.41
 port default disable
 weight 10 0
 port http
 port http keepalive
 port http url "GET /alarm/"

And this automatically gets /alarm/index.php as per configuration on your lighttpd server?

 port http status_code  200 299
!
And the virtual server :

server virtual virtual.domain x.x.x.225
 port default dsr
 port http sticky
 port http dsr
 bind http server01.domain http server02.domain http server03.domain
http server04.domain http
 bind http server05.domain http server06.domain http
!

I don't exactly remember the FSM on the ServerIron hardware and unfortunately these days one does not get access to their documentation anymore, without a KP id :(. However, your configuration looks pretty straight-forward and should definitely work. I'm just not sure if the ServerIron OS distinguishes between HTTP no response and HTTP not expected response?

What happens if your modify your PHP health check status script to actually set code 500 for all HTTP requests? Do any of the RS get set up, either with the ServerIron or the LVS?

The similar configuration with LVS (using keepalived) :

I'm not too familiar with the inner workings of keepalived, so maybe Alexandre should throw an eye on this as well.

virtual_server x.x.x.229 80 {
    delay_loop 6

This seems pretty short, considering you've 6 RS to check.

    lb_algo rr
    lb_kind DR
    persistence_timeout 30
    protocol TCP

    real_server x.x.x.41 80 {
        weight 10
        HTTP_GET {
            url {
                path /alarm/
                status_code 200
            }
            connect_timeout 5
            nb_get_retry 2
            delay_before_retry 5
        }
    }

! etc. for all other 5

}

How exactly do you get your RS to dynamically switch from HTTP response code 200 to 500? Have you checked the HTTP response header using a CLI tool like curl, lynx or wget?

Various ways. I'm using lighttpd with PHP as FastCGI, so by checking
a /alarm/index.php script :
- I get a 500 from lighttpd if the PHP backend is overloaded or dead
And right now I've extended this PHP script to keep sending 500s in
more situations, in order to avoid "plip-flopping" :
- I get a 500 from the script if the main db connection is down
- I get a 500 from the script of the server's 1min avg load is > 20

So what happens if you shut down all your DBs and restart keepalived? How does the ipvsadm -Ln output look like?

I've checked with "curl -I" and get the status I expect in every case.

Ok.

I would like to have tried some kind of "keep the real server disabled
for n seconds when it's detected as down" in order to keep the check
from flip-flopping like this, but there is no such setting in
keepalived AFAICS.
Would it be possible and good enough for you to use the threshold limitation feature by setting an upper and lower threshold for the amount of active + inactive connections?

I've got a bit more information after running LVS for the past weeks
(without sending any real traffic to the virtual server IP address,
though, I use the ServerIron's virtual IP address currently). I keep
getting read timeouts from keepalived, so at a higher level it seems
that there already is an issue. The ServerIron reports no similar
timeouts against the same servers, which are running fine.

Health check read timeouts?

Anyhow, this is something I definitely need to fix before digging any
more about the LVS issue I reported initially.

Fair enough. Good luck,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread] Current Thread [Next in Thread>