Re: Our LVS/DR backends freezes

To:	"Joseph Mack NA3T" <jmack@xxxxxxxx>, <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject:	Re: Our LVS/DR backends freezes
From:	Olle Östlund <olle@xxxxxxxxxxx>
Date:	Mon, 27 Nov 2006 23:04:16 +0100

Our LVS-backends typically freezes after 5-7 days of normal operation.
These freezes are not system-crashes, but it seems like all new
TCP-connections towards the servers will hang forever. It is impossible
to logon or perform an su (they will hang), but existing sessions will
function fine as long as you don't issue 'critical commands' (commands
which perform a tcp-connection?).

such as loging in remotely, but not from the console...?

Remote logon via ssh does not work (hangs or will get a "connection closedby remote server" type of message). Logging in via the console does not work(hangs after password has been typed in), using su will produce the sameresult. Exisiting logged-in sessions continue to function however. Commandslike the various system status commands (top, netstat, vmstat, ..) worksfine. Commands which are known to log via syslog (su, ssh, ...) will hang.

The syslogd stops writing the syslog,
etc. Looking at the servers activity using top reveals nothing abnormal
-- there is no swapping, cpu-usage is low, etc.

what about various outputs from ipvsadm on the director?
Anything monotonically increasing there (I know the problem
is on the realservers)?

The number of active connections tend to climb during the day and decreaseat night. Each backend is "taken out of business" (but not rebooted) eachnight and the Tomcats restarted (all done in a graceful way). Nothingmonotonically increasing as far as we have seen, no.

We are running a "weighted round robin" load balancer algoritm, and thedirector will set weight 0 (= no traffic) on a backend once it has frozen(not responding to the directors status requests). Then the number of activeconnections sloooowly drops.

Anything monotonically increasing on the realservers (eg
look with netstat to see if running out of ports or all
connections in FIN_WAIT)?

Netstat typically reports a normal number of connections (~ 100) in variousstates.

"Runnig out of ports" sounds interesting. It sure looks like we are runningout of something. How would one go about detecting if we are running out ofports? Remeber, if looks like we are running out of something which causesapplications to wait instead of bailing out with an error message. Wouldrunning out of ports cause applications to wait?

The cluster is hosting 14 websites, typically serving 1.5 million
request a normal day. As I said the cluster may run fine for a week and
then suddly the backends freezes. The funny thing is that both backend
usually freezes at roughly the same time.

presumably they've been evenly balanced.

Yes, all backends are equally weighted. This is not a real issue, we canhandle all the traffic using one backend only (which we do from time totime).

I take it that this has something to do with LVS, ie you
don't get the same behaviour with a bare single server?

You are correct. As we are in middle of the process of moving all our sitesonto the LVS-cluster, we still have sites running on our old single-serverplatform (same hardware and software as the backends of the cluster). Todaythe single-server is serving about the same amount of requests as thecluster.

The things I find most odd about this is 1) it affects both backends at thesame time 2) applications does not die with an error-message but hangs. Myconclusion is that we are running out of a resource which is consumed by theIP-traffic from the director (or possibly the Tomcat cross-backendsession-replication communication), and the resource is one you wait for(you don't bail out and complain).

The only cure we have come up with so far is to reboot the servers. Once
rebooted a server will run for days again. It has ocationally happed
that the second (frozen) server has recoverd once the first server is
rebooted.

do the realservers talk to each other (eg have a common disk
system)?

Well, no disk- or file-sharing. The Tomcats are communicating HTTPsession-replication data between backends. Our HTTP-sessions contain verylittle data, so the traffic should not be that big.

Anyone out there having a good idea where to look for clues to what may
be wrong?

<Prev in Thread]	Current Thread	[Next in Thread>
Our LVS/DR backends freezes, Olle Ö?stlund Re: Our LVS/DR backends freezes, Joseph Mack NA3T Re: Our LVS/DR backends freezes, Olle Östlund <= Re: Our LVS/DR backends freezes, Mark de Vries Re: Our LVS/DR backends freezes, Joseph Mack NA3T Re: Our LVS/DR backends freezes, Olle Ö?stlund Re: Our LVS/DR backends freezes, Joseph Mack NA3T Re: Our LVS/DR backends freezes, Olle Ö?stlund Re: Our LVS/DR backends freezes, Joseph Mack NA3T Re: Our LVS/DR backends freezes, Horms Re: Our LVS/DR backends freezes, Horms Re: Our LVS/DR backends freezes, Olle Ö?stlund Re: Our LVS/DR backends freezes, Joseph Mack NA3T

Previous by Date:	Multiple websites with multiple ip addresses, Doug Curtis
Next by Date:	Re: Our LVS/DR backends freezes, Mark de Vries
Previous by Thread:	Re: Our LVS/DR backends freezes, Joseph Mack NA3T
Next by Thread:	Re: Our LVS/DR backends freezes, Mark de Vries
Indexes:	[Date] [Thread] [Top] [All Lists]