Alexandre Cassen wrote (at Fri, Feb 16, 2007 at 10:08:54AM +0100):
> Hi Casey,
>
> On Thu, 2007-02-15 at 22:14 -0600, Casey Zacek wrote:
> > I tried loading up a keepalived.conf with about 1400 real servers
> > total, and the last 200-300 of them didn't get any health checks.
> > There was no error logged that I could find. All of the healthchecks
> > were "Activated" in the log, but the bottom chunk of them just didn't
> > seem to happen -- all of the RSes in question were not answering HTTP,
> > so they should have failed the checks, but only the first 1200 or so
> > (the number would vary a little with each restart of keepalived)
> > actually had any results show in the log or in 'ipvsadm -ln'.
> >
> > What are the limits of keepalived? What things can be adjusted to get
> > more out of it?
These may be worthy of note as well:
Feb 15 01:25:47 lvs1 Keepalived_vrrp: Netlink: skipping nl_cmd msg...
Lots and lots of them. Sorry for leaving them out before. The
digging around I did suggested that it was unrelated. I could be
wrong -- it was late.
> hmmm... provide your conf file please,
Unfortunately, I can't divulge my config. I can give a general
synopsis, though. First, I have my working config which has 211
real_servers total. To this, I added the new configuration:
First, a VRRP block with one virtual_ipaddress and 297
virtual_ipaddress_excluded's. Then, I had 298 of these virtual_server
blocks:
virtual_server fwmark 200 {
delay_loop 20
lb_algo wlc
lb_kind TUN
! persistence_timeout 1200
protocol TCP
virtualhost sitename.com
real_server a.b.c.10 0 {
weight 50
HTTP_GET {
url {
path /
status_code 200
}
connect_port 80
connect_timeout 12
nb_get_retry 2
delay_before_retry 1
}
}
real_server a.b.c.11 0 {
weight 50
HTTP_GET {
url {
path /
status_code 200
}
connect_port 80
connect_timeout 12
nb_get_retry 2
delay_before_retry 1
}
}
real_server a.b.c.12 0 {
weight 50
HTTP_GET {
url {
path /
status_code 200
}
connect_port 80
connect_timeout 12
nb_get_retry 2
delay_before_retry 1
}
}
real_server a.b.c.13 0 {
weight 50
HTTP_GET {
url {
path /
status_code 200
}
connect_port 80
connect_timeout 12
nb_get_retry 2
delay_before_retry 1
}
}
}
> most of the time configuration
> can be cleaned up by using _server_pool keywords (cf.
> keepalived.conf.SYNOPSYS).
cz:keepalived-1.1.13/doc% grep -i pool **/*(.)
cz:keepalived-1.1.13/doc%
What is this "_server_pool" of which you speak?
> But I recall a huge working conf with around
> 2000 realservers.
I did try again with 'ulimit -n 4096' on a hunch. This resulted in
the healthchecker crashing over and over and over again:
Feb 15 02:10:38 lvs1 Keepalived: Healthcheck child process(4548) died:
Respawning
> please provide a strace of the select() system call... select is great
> with not lot of fd, but with around 1000 fds its ok ok... greater,
> poll/epoll must be used (not yet implemented)
Do you have a specific strace command-line you'd like me to use?
--
Casey Zacek
Senior Engineer
NeoSpire, Inc.
|