> I've set up a test set of servers in our lab to play with the LVS stuff
> to see why we're running into file-handle issues like we have been and
> I've discovered something really interesting. (At least to me...) For
> our lab setup, we only have a primary LVS server, but still have a lot
> of services and whatnot the same way we have at the client site and we
> don't have a problem at all with the LVS box using up file-handles. It
> seems to be related to the communication between the primary and backup
> LVS boxes. Since we're using the RedHat HA code, I'm going to guess
> that there's something in 'pulse' for doing monitoring of primary/backup
> LVS boxes that has a leak somewhere. I'll try to dig up another box to
> set up as a secondary LVS server and see if I can recreate the situation
> exactly in our lab.
Do an lsof and see what pulse is doing with all of those handles.
I don't talk much with the RedHat-HA folks -- but does anyone who might be
reading this know if much testing was done with respect to redundant load
balancers using the RedHat-HA model?
> pulse/LVS/nanny start up and allocate a buttload of file-handles on both
> the primary and backup servers (~20000 in our case)
> over the course of a day or two the primary server slowly releases
> handles until the number drops to something reasonable (in our case,
> it's gone from ~20000 to 47 in about 36 hours)
Maybe it has something to do with the communication between the two
directors. Perhaps the directors try aggressively to talk to each other and
end up timing out and only the communication channels which actually are
being used survive. (after the timeouts)
> the backup still has waaaaay too many file-handles allocated
> if the machines switch duties, the behaviour stays consistent with
> functionality (i.e. the machine that's now primary slowly releases
> handles, etc)
Again -- perhaps when the backup server switches over it then tries to open
up communication with the primary again to see when the backup should shutup
and be quiet.
Back before heartbeat was as advanced as it is now, I developed a small
"etherbeat" script that I run on both of my directors. It's designed so that
both LinuxDirectors run as peers. It elects a master between the two and
that master takes over. When that master dies, a new election occurs and the
slave takes over when the master fails to vote. This allows me to put the
exact same setup on both LinuxDirectors and simply turn one on a few moments
after I turn the other on. Both come up... and when one goes down, the other
one goes into master mode. It requires very little resources and operates
in-band.
Though some people like an out-of-band solution, or a mixture of the two, in
my setup where I only have one NIC in each director if the network goes down
on one, the other will detect it and take over immediately... It works.
Anyway -- it'd be interesting to see what lsof says pulse is using those
filehandles for.
All the best --
Ted
|