On Sat, 2008-11-01 at 06:39 -0700, Robinson, Eric wrote:
> > ipvsadm -L -n --stats
> > ipvsadm -L -n --rate
>
> The --stats switch gives totals for packets in/out.
> The --rate switch shows pps, but the man page does not say what the
> averaging period is. I'll try it during a busy production day and see
> what I get.
>From ip_vs_est.c:
/*
This code is to estimate rate in a shorter interval (such as 8
seconds) for virtual services and real servers. For measure rate in a
long interval, it is easy to implement a user level daemon which
periodically reads those statistical counters and measure rate.
Currently, the measurement is activated by slow timer handler. Hope
this measurement will not introduce too much load.
We measure rate during the last 8 seconds every 2 seconds:
avgrate = avgrate*(1-W) + rate*W
where W = 2^(-2)
NOTES.
* The stored value for average bps is scaled by 2^5, so that maximal
rate is ~2.15Gbits/s, average pps and cps are scaled by 2^10.
* A lot code is taken from net/sched/estimator.c
*/
I don't believe that's changed much recently, if at all.
> I changed it to 5 seconds, but no significant change was apparent. Then
> I changed it to 10 seconds and there was a definite, observable drop in
> CPU utilization. A graph of the past 6 hours shows that usage has now
> flattened out and is now averaging less than 10%.
Can you please put the graphs up that you sent separately on a
webserver, so we can see what you're describing? [From a list admin
perspective the second message encoded to just over a megabyte which is
pretty expensive when sending to over 1200 recipients...]
It's good, though, that it made a difference because we have now
narrowed the cause of the problem down to health checks.
> I don't know. These are medical applications, and doctors are often
> grumpy about transient glitches in their applications while trying to
> document patient encounters. I'm thinking something like this might
> possibly work for the short term:
>
> checkinterval=10
> checktimeout=5
> negotiatetimeout=8
> checkcount=1
I guess if this is acceptable then... it's acceptable.
> But it would still be a temporary solution and this raises a general
> question about load-balancer scaling. Right now I have 120 VS, but in a
> year or two it will be 240. In 4 years it could be 500+. I can't just
> keep increasing the checkinterval. Ultimately, I'm going to have to try
> multiple instances of ldirectord.
Or try one of the alternatives, like keepalived. I've no idea how that
scales out to 187 checks (or more), mind you.
> Which raises a question about LVS. Could it get confused with multiple
> ldirectord instances constantly forking ipvsadm?
As long as they are managing discrete pools of virtual & real servers,
then no I don't think it will *unless* you hit the problem someone else
reported very recently where realservers seem to migrate between
virtuals at random. Horms was going to try to work on that, but it might
be tricky to isolate.
For such a large number of realservers I think you may need to get
creative with your healthchecking. You could use the "checkcommand"
setting to ldirectord to read a value from a file which is kept updated
by some other script which can check in parallel. Unfortunately I can't
pull one of those out of a hat right now... :)
Thinking about it laterally, how does something like Nagios cope with a
very large number of service checks? It does them in parallel, by
running multiple threads. So does OpenNMS, and Zabbix, and in fact
pretty much every one of the decent (fsvo "decent") NMS apps I've ever
used.
Making ldirectord threaded and parallel however isn't likely to start
working straight away! Anyone fancy a stab at that?
Graeme
|