Hello,
Every time i reboot the active node, it comes back as the backup as normal,
but then it suddenly declares itself dead and says it has no local
heartbeat
(???) and restarts. While it's restarting it happily declares the other
node
dead as well and (i guess) starts taking over the resources. Resulting in
every connected client to disconnect.
Sounds like timing issues. This is also a typical question for the
linux-ha mailinglist where people can give you appropriate answers in
shorter time than here normally.
I also see that it says somewhere "Deadtime value may be too small", but in
normal production i don't see any 'late heartbeats' or such, which made me
not change them. My ha.cf :
udpport 694
logfacility local0
keepalive 75ms
deadtime 300ms
warntime 200ms
Your timings are absolutely crazy. This will only work in the lab. Also,
there's no point in having such a snappy system, especially if you
configure template synchronisation, when deploying LVS.
http://www.linux-ha.org/ha.cf/DeadtimeDirective
http://www.linux-ha.org/FAQ#heavy_load
initdead 60
mcast eth1 224.1.2.3 694 1 0
auto_failback off
node rpzlvs05 rpzlvs06
My question is, should i really go experiment with the *time values again,
or is it something else?
In my opinion you should instrument those values to a more sane value.
Also note that even though the kernel operates between 100Hz and 1000Hz,
there is no guarantee user-space gets assigned 10ms-1ms slices,
especially during boot up, where we have a fork-bomb situation with all
the deamons starting and writing their shit on the platter. Unless you
run a hard RT-enabled kernel, you get blocking I/O peeks in the high 100ms.
I would be surprised if setting a higher deadtime does not fix your
issues, then again the experts are next door on the linux-ha mailinglist.
HTH and best regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
|