Our LVS-backends typically freezes after 5-7 days of normal operation.
These freezes are not system-crashes, but it seems like all new
TCP-connections towards the servers will hang forever. It is impossible
to logon or perform an su (they will hang), but existing sessions will
function fine as long as you don't issue 'critical commands' (commands
which perform a tcp-connection?).
such as loging in remotely, but not from the console...?
Remote logon via ssh does not work (hangs or will get a "connection closed
by remote server" type of message). Logging in via the console does not work
(hangs after password has been typed in), using su will produce the same
result. Exisiting logged-in sessions continue to function however. Commands
like the various system status commands (top, netstat, vmstat, ..) works
fine. Commands which are known to log via syslog (su, ssh, ...) will hang.
The syslogd stops writing the syslog,
etc. Looking at the servers activity using top reveals nothing abnormal
-- there is no swapping, cpu-usage is low, etc.
what about various outputs from ipvsadm on the director?
Anything monotonically increasing there (I know the problem
is on the realservers)?
The number of active connections tend to climb during the day and decrease
at night. Each backend is "taken out of business" (but not rebooted) each
night and the Tomcats restarted (all done in a graceful way). Nothing
monotonically increasing as far as we have seen, no.
We are running a "weighted round robin" load balancer algoritm, and the
director will set weight 0 (= no traffic) on a backend once it has frozen
(not responding to the directors status requests). Then the number of active
connections sloooowly drops.
Anything monotonically increasing on the realservers (eg
look with netstat to see if running out of ports or all
connections in FIN_WAIT)?
Netstat typically reports a normal number of connections (~ 100) in various
states.
"Runnig out of ports" sounds interesting. It sure looks like we are running
out of something. How would one go about detecting if we are running out of
ports? Remeber, if looks like we are running out of something which causes
applications to wait instead of bailing out with an error message. Would
running out of ports cause applications to wait?
The cluster is hosting 14 websites, typically serving 1.5 million
request a normal day. As I said the cluster may run fine for a week and
then suddly the backends freezes. The funny thing is that both backend
usually freezes at roughly the same time.
presumably they've been evenly balanced.
Yes, all backends are equally weighted. This is not a real issue, we can
handle all the traffic using one backend only (which we do from time to
time).
I take it that this has something to do with LVS, ie you
don't get the same behaviour with a bare single server?
You are correct. As we are in middle of the process of moving all our sites
onto the LVS-cluster, we still have sites running on our old single-server
platform (same hardware and software as the backends of the cluster). Today
the single-server is serving about the same amount of requests as the
cluster.
The things I find most odd about this is 1) it affects both backends at the
same time 2) applications does not die with an error-message but hangs. My
conclusion is that we are running out of a resource which is consumed by the
IP-traffic from the director (or possibly the Tomcat cross-backend
session-replication communication), and the resource is one you wait for
(you don't bail out and complain).
The only cure we have come up with so far is to reboot the servers. Once
rebooted a server will run for days again. It has ocationally happed
that the second (frozen) server has recoverd once the first server is
rebooted.
do the realservers talk to each other (eg have a common disk
system)?
Well, no disk- or file-sharing. The Tomcats are communicating HTTP
session-replication data between backends. Our HTTP-sessions contain very
little data, so the traffic should not be that big.
Anyone out there having a good idea where to look for clues to what may
be wrong?
|