Peter Mueller <pmueller@xxxxxxxxxxxx> writes:
> > 2am on Saturday, the primary failed. The (serial) console wasn't
> >responding to anything but sysrq, and then only to reboot. (the
> >heartbeat didn't properly failover, but that's another story). The
> >primary was restarted, but failed again at 8:30am, at which point the
> >secondary took over... which then failed the same way at 2pm on
> >Sunday - back to primary, failed again 1:30am, and 2:30am Monday.
>
> what constitutes a failure? are there any logs or oops's etc. with
> relevant information?
>
Nope - it just hung! nothing in any log files, nothing on the console.
> > Kernel 2.4.18, with LVS kernel patch 1.0.2.
>
> you might want to consider a 2.4.19-pre. check the changelog for
> things of interest to you..
>
I'll have a look...
> > Any thoughts? The logs show nothing interesting, the failures
> >weren't at highly loaded times (2am sees very little traffic), and
> >one of the failures was only an hour after one of the previous ones.
>
> what was the heartbeat problem? although your situation seems
> 'coincidental', are you certain it isn't hardware? you don't indicate
> what has happened since Monday...
Since monday everything has been working fine - I'm not sure what the
heartbeat problem was - this was the first "real" failover situation,
it had worked fine in testings (as always) - unfortunately I wasn't
around to look at it - heartbeat was restarted on the standby and it
took over. All subsequent failures over the weekend worked fine, which
is why I'm discounting it as relevant.
Certainly, barring a nasty coincidence, it looks line it was the
daemon causing it.
Chris
|