LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Major issue with LVS-DR when a server gets overloaded

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: Major issue with LVS-DR when a server gets overloaded
From: Matthias Saou <thias@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 15 Feb 2007 11:33:01 +0100
Roberto Nibali wrote :

> Either a massive bug in the ServerIron Firmware or a configuration 
> glitch on your side. Care to post the relevant part of the configuration?

In the ServerIron, each of the 6 real servers looks like this :

server real server01.domain x.x.x.41
 port default disable
 weight 10 0
 port http
 port http keepalive
 port http url "GET /alarm/"
 port http status_code  200 299
!

And the virtual server :

server virtual virtual.domain x.x.x.225
 port default dsr
 port http sticky
 port http dsr
 bind http server01.domain http server02.domain http server03.domain
http server04.domain http
 bind http server05.domain http server06.domain http
!

The similar configuration with LVS (using keepalived) :

virtual_server x.x.x.229 80 {
    delay_loop 6
    lb_algo rr
    lb_kind DR
    persistence_timeout 30
    protocol TCP

    real_server x.x.x.41 80 {
        weight 10
        HTTP_GET {
            url {
                path /alarm/
                status_code 200
            }
            connect_timeout 5
            nb_get_retry 2
            delay_before_retry 5
        }
    }

! etc. for all other 5

}

> How exactly do you get your RS to dynamically switch from HTTP response 
> code 200 to 500? Have you checked the HTTP response header using a CLI 
> tool like curl, lynx or wget?

Various ways. I'm using lighttpd with PHP as FastCGI, so by checking
a /alarm/index.php script :
- I get a 500 from lighttpd if the PHP backend is overloaded or dead
And right now I've extended this PHP script to keep sending 500s in
more situations, in order to avoid "plip-flopping" :
- I get a 500 from the script if the main db connection is down
- I get a 500 from the script of the server's 1min avg load is > 20

I've checked with "curl -I" and get the status I expect in every case.

> > I am really confused. I've tried stopping the web daemon on one of the
> > real servers under production load, and it gets taken out as expected,
> > and all keeps working fine. It seems that only when the web server still
> > responds with 500 status and gets detected as down, then up, then down
> > again etc. does the problem appear. Note that the setup can work fine
> > for hours and hours, the issue only appears when a real server has a
> > problem.
> 
> This however sounds more like a "flapping" or threshold ping-pong issue.
> 
> > I would like to have tried some kind of "keep the real server disabled
> > for n seconds when it's detected as down" in order to keep the check
> > from flip-flopping like this, but there is no such setting in
> > keepalived AFAICS.
> 
> Would it be possible and good enough for you to use the threshold 
> limitation feature by setting an upper and lower threshold for the 
> amount of active + inactive connections?

I've got a bit more information after running LVS for the past weeks
(without sending any real traffic to the virtual server IP address,
though, I use the ServerIron's virtual IP address currently). I keep
getting read timeouts from keepalived, so at a higher level it seems
that there already is an issue. The ServerIron reports no similar
timeouts against the same servers, which are running fine.

Anyhow, this is something I definitely need to fix before digging any
more about the LVS issue I reported initially.

Thanks for your answers.

Matthias

-- 
Clean custom Red Hat Linux rpm packages : http://freshrpms.net/
Fedora Core release 6 (Zod) - Linux kernel 2.6.19-1.2895.fc6
Load : 0.48 0.39 0.40

<Prev in Thread] Current Thread [Next in Thread>