LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: keepalived limits

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: keepalived limits
From: Casey Zacek <cz@xxxxxxxxxxxx>
Date: Fri, 16 Feb 2007 10:27:22 -0600
Alexandre Cassen wrote (at Fri, Feb 16, 2007 at 10:08:54AM +0100):
> Hi Casey,
> 
> On Thu, 2007-02-15 at 22:14 -0600, Casey Zacek wrote:
> > I tried loading up a keepalived.conf with about 1400 real servers
> > total, and the last 200-300 of them didn't get any health checks.
> > There was no error logged that I could find.  All of the healthchecks
> > were "Activated" in the log, but the bottom chunk of them just didn't
> > seem to happen -- all of the RSes in question were not answering HTTP,
> > so they should have failed the checks, but only the first 1200 or so
> > (the number would vary a little with each restart of keepalived)
> > actually had any results show in the log or in 'ipvsadm -ln'.
> > 
> > What are the limits of keepalived?  What things can be adjusted to get
> > more out of it?

These may be worthy of note as well:

Feb 15 01:25:47 lvs1 Keepalived_vrrp: Netlink: skipping nl_cmd msg...

Lots and lots of them.  Sorry for leaving them out before.  The
digging around I did suggested that it was unrelated.  I could be
wrong -- it was late.

> hmmm... provide your conf file please,

Unfortunately, I can't divulge my config.  I can give a general
synopsis, though.  First, I have my working config which has 211
real_servers total.  To this, I added the new configuration:

First, a VRRP block with one virtual_ipaddress and 297
virtual_ipaddress_excluded's.  Then, I had 298 of these virtual_server
blocks:

virtual_server fwmark 200 {
        delay_loop 20
        lb_algo wlc
        lb_kind TUN
!       persistence_timeout 1200
        protocol TCP
        virtualhost sitename.com
        real_server a.b.c.10 0 {
                weight 50
                HTTP_GET {
                        url {
                                path /
                                status_code 200
                        }
                        connect_port 80
                        connect_timeout 12
                        nb_get_retry 2
                        delay_before_retry 1
                }
        }
        real_server a.b.c.11 0 {
                weight 50
                HTTP_GET {
                        url {
                                path /
                                status_code 200
                        }
                        connect_port 80
                        connect_timeout 12
                        nb_get_retry 2
                        delay_before_retry 1
                }
        }
        real_server a.b.c.12 0 {
                weight 50
                HTTP_GET {
                        url {
                                path /
                                status_code 200
                        }
                        connect_port 80
                        connect_timeout 12
                        nb_get_retry 2
                        delay_before_retry 1
                }
        }
        real_server a.b.c.13 0 {
                weight 50
                HTTP_GET {
                        url {
                                path /
                                status_code 200
                        }
                        connect_port 80
                        connect_timeout 12
                        nb_get_retry 2
                        delay_before_retry 1
                }
        }
}

> most of the time configuration
> can be cleaned up by using _server_pool keywords (cf.
> keepalived.conf.SYNOPSYS).

cz:keepalived-1.1.13/doc% grep -i pool **/*(.)
cz:keepalived-1.1.13/doc%                       

What is this "_server_pool" of which you speak?

> But I recall a huge working conf with around
> 2000 realservers.

I did try again with 'ulimit -n 4096' on a hunch.  This resulted in
the healthchecker crashing over and over and over again:

Feb 15 02:10:38 lvs1 Keepalived: Healthcheck child process(4548) died: 
Respawning

> please provide a strace of the select() system call... select is great
> with not lot of fd, but with around 1000 fds its ok ok... greater,
> poll/epoll must be used (not yet implemented)

Do you have a specific strace command-line you'd like me to use?

-- 
Casey Zacek
Senior Engineer
NeoSpire, Inc.

<Prev in Thread] Current Thread [Next in Thread>