Did you monitor the swap size? Is it getting out of control? Are you
using a modded ldirectord to monitor a custom service protocol, if so?
On Tue, 2006-01-03 at 20:51, Bruce Richardson wrote:
> I have a legacy ultramonkey configuration in a production environment
> that is causing bizarre problems. 2 IBM servers running Debian Sarge
> with a 2.6 kernel (custom compiled 2.6.6 kernel), with both servers
> running both the syncmaster and syndbackup processes. Unfortunately,
> the person who set this up didn't leave a source deb or any notes about
> what they did. There are also slight version differences between some
> of the components on the two boxes (I know, it's a mess, I didn't crate
> it) due to only one of the boxes having had the ultramonkey repository
> in sources.list.
>
> This pair has been used with one of them as a primary and the other only
> ever briefly taking charge. It seems (this is a set-up that I
> inherited) that the primary was failing every 3 or 4 months. The
> secondary would then fail if left in master mode for more than a week.
>
> To try and fix this mess, I span up two vanilla Debian Sarge boxes with
> the latest ldirectord and hearbeat packages. When I used one of them to
> replace the secondary, it died only a few minutes after the primary
> failed over to it. It then died again shortly afterwards even on
> standby.
>
> When I say "die", I mean complete and immediate freeze with no
> indications in the logs and a frozen screen (if a console is connected
> at the time). Absolutely no indication of what might be the cause.
>
> I have similar director-pairs in other environments that cause no such
> problems. There are three main differences between those systems and
> this pair: the healthy systems use
>
> 1. Stock Debian 2.6.8 kernels and packages.
> 2. IPaddr2 rather than IPaddr
> 3. Connection syncing only in master->slave mode (as opposed to
> master->master) or simply not at all.
>
> My feeling with this is that the connection tracking/syncing is at the
> root of the problem, possibly the fact that it is doing master->master.
> The very speedy death of the vanilla Sarge box that I tried to put in as
> a secondary tends to reinforce this in my mind.
>
> Can anybody offer any thoughts?
--
Andrei Taranchenko <andrei@xxxxxxxxxxxxx>
TowerData
|