Re: Our LVS/DR backends freezes

To: "Joseph Mack NA3T" <jmack@xxxxxxxx>, <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Our LVS/DR backends freezes
From: Olle Östlund <olle@xxxxxxxxxxx>
Date: Mon, 27 Nov 2006 23:04:16 +0100

Our LVS-backends typically freezes after 5-7 days of normal operation.
These freezes are not system-crashes, but it seems like all new
TCP-connections towards the servers will hang forever. It is impossible
to logon or perform an su (they will hang), but existing sessions will
function fine as long as you don't issue 'critical commands' (commands
which perform a tcp-connection?).

such as loging in remotely, but not from the console...?

Remote logon via ssh does not work (hangs or will get a "connection closed by remote server" type of message). Logging in via the console does not work (hangs after password has been typed in), using su will produce the same result. Exisiting logged-in sessions continue to function however. Commands like the various system status commands (top, netstat, vmstat, ..) works fine. Commands which are known to log via syslog (su, ssh, ...) will hang.

The syslogd stops writing the syslog,
etc. Looking at the servers activity using top reveals nothing abnormal
-- there is no swapping, cpu-usage is low, etc.

what about various outputs from ipvsadm on the director?
Anything monotonically increasing there (I know the problem
is on the realservers)?

The number of active connections tend to climb during the day and decrease at night. Each backend is "taken out of business" (but not rebooted) each night and the Tomcats restarted (all done in a graceful way). Nothing monotonically increasing as far as we have seen, no.

We are running a "weighted round robin" load balancer algoritm, and the director will set weight 0 (= no traffic) on a backend once it has frozen (not responding to the directors status requests). Then the number of active connections sloooowly drops.

Anything monotonically increasing on the realservers (eg
look with netstat to see if running out of ports or all
connections in FIN_WAIT)?

Netstat typically reports a normal number of connections (~ 100) in various states.

"Runnig out of ports" sounds interesting. It sure looks like we are running out of something. How would one go about detecting if we are running out of ports? Remeber, if looks like we are running out of something which causes applications to wait instead of bailing out with an error message. Would running out of ports cause applications to wait?

The cluster is hosting 14 websites, typically serving 1.5 million
request a normal day. As I said the cluster may run fine for a week and
then suddly the backends freezes. The funny thing is that both backend
usually freezes at roughly the same time.

presumably they've been evenly balanced.

Yes, all backends are equally weighted. This is not a real issue, we can handle all the traffic using one backend only (which we do from time to time).

I take it that this has something to do with LVS, ie you
don't get the same behaviour with a bare single server?

You are correct. As we are in middle of the process of moving all our sites onto the LVS-cluster, we still have sites running on our old single-server platform (same hardware and software as the backends of the cluster). Today the single-server is serving about the same amount of requests as the cluster.

The things I find most odd about this is 1) it affects both backends at the same time 2) applications does not die with an error-message but hangs. My conclusion is that we are running out of a resource which is consumed by the IP-traffic from the director (or possibly the Tomcat cross-backend session-replication communication), and the resource is one you wait for (you don't bail out and complain).

The only cure we have come up with so far is to reboot the servers. Once
rebooted a server will run for days again. It has ocationally happed
that the second (frozen) server has recoverd once the first server is

do the realservers talk to each other (eg have a common disk

Well, no disk- or file-sharing. The Tomcats are communicating HTTP session-replication data between backends. Our HTTP-sessions contain very little data, so the traffic should not be that big.

Anyone out there having a good idea where to look for clues to what may
be wrong?

<Prev in Thread] Current Thread [Next in Thread>