Our LVS-backends typically freezes after 5-7 days of normal operation.
These freezes are not system-crashes, but it seems like all new
TCP-connections towards the servers will hang forever. It is impossible
to logon or perform an su (they will hang), but existing sessions will
function fine as long as you don't issue 'critical commands' (commands
which perform a tcp-connection?). The syslogd stops writing the syslog,
etc. Looking at the servers activity using top reveals nothing abnormal
-- there is no swapping, cpu-usage is low, etc.
Background on our system
We are running a LVS/DR setup of two frontend servers (one master
director and one heartbeat failover director), and two backend servers,
all based on HP Proliant/Suse Linux Enterprise Server 9. Each backend is
running two Tomcat-instances, each listening on an instance-specific
ip-interface (i e the cluster has two interfaces -- one for Tomcat
instance 1 and another for Tomcat instance 2). The frontend
load-balancer is using the direct-routing principle. Everything
LVS-related is build upon the standard stuff that ships with SLES9.
The cluster is hosting 14 websites, typically serving 1.5 million
request a normal day. As I said the cluster may run fine for a week and
then suddly the backends freezes. The funny thing is that both backend
usually freezes at roughly the same time.
The only cure we have come up with so far is to reboot the servers. Once
rebooted a server will run for days again. It has ocationally happed
that the second (frozen) server has recoverd once the first server is
Anyone out there having a good idea where to look for clues to what may
be wrong? We have tried to identify a undersized tcp-related resource,
which we suspect are causing our problems, but we haven't found a good