Dear all,
On Friday, I enabled the synchronisation daemon on our two machine
director cluster. We're using heartbeat to failover from the active
one to the standby.
2am on Saturday, the primary failed. The (serial) console wasn't
responding to anything but sysrq, and then only to reboot. (the
heartbeat didn't properly failover, but that's another story). The
primary was restarted, but failed again at 8:30am, at which point the
secondary took over... which then failed the same way at 2pm on Sunday
- back to primary, failed again 1:30am, and 2:30am Monday.
I was away over the weekend and realised something was wrong from all
the mon alerts to my mobilephone :-( - its all fairly new, but had
happily run for most of last week, so I realised it was the daemon
stuff that I'd put in on friday...
So today, I'm back at work, with large number of doughnuts to the guys
who were on call over the weekend, investigating what went wrong.
Running:
Kernel 2.4.18, with LVS kernel patch 1.0.2.
Debian Woody (up to date)
ipvsadm: 1.20release6-2 (from the debian package)
Using LVS-DR to route web (and mail, irc, and https) traffic to two
realservers.
Any thoughts? The logs show nothing interesting, the failures weren't
at highly loaded times (2am sees very little traffic), and one of the
failures was only an hour after one of the previous ones.
The failures also only occured on the master daemon - the standby had
exactly the same rules, and was receiving the sync data, but stayed up.
Thanks in advance
Chris
|