LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: heartbeat node taking over resources upon reboot

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: heartbeat node taking over resources upon reboot
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Fri, 10 Nov 2006 19:55:00 +0100
Hello,

Every time i reboot the active node, it comes back as the backup as normal,
but then it suddenly declares itself dead and says it has no local heartbeat (???) and restarts. While it's restarting it happily declares the other node
dead as well and (i guess) starts taking over the resources. Resulting in
every connected client to disconnect.

Sounds like timing issues. This is also a typical question for the linux-ha mailinglist where people can give you appropriate answers in shorter time than here normally.

I also see that it says somewhere "Deadtime value may be too small", but in
normal production i don't see any 'late heartbeats' or such, which made me
not change them. My ha.cf :

udpport 694
logfacility local0
keepalive 75ms
deadtime 300ms
warntime 200ms

Your timings are absolutely crazy. This will only work in the lab. Also, there's no point in having such a snappy system, especially if you configure template synchronisation, when deploying LVS.

http://www.linux-ha.org/ha.cf/DeadtimeDirective
http://www.linux-ha.org/FAQ#heavy_load

initdead 60
mcast eth1 224.1.2.3 694 1 0
auto_failback off
node rpzlvs05 rpzlvs06

My question is, should i really go experiment with the *time values again,
or is it something else?

In my opinion you should instrument those values to a more sane value. Also note that even though the kernel operates between 100Hz and 1000Hz, there is no guarantee user-space gets assigned 10ms-1ms slices, especially during boot up, where we have a fork-bomb situation with all the deamons starting and writing their shit on the platter. Unless you run a hard RT-enabled kernel, you get blocking I/O peeks in the high 100ms.

I would be surprised if setting a higher deadtime does not fix your issues, then again the experts are next door on the linux-ha mailinglist.

HTH and best regards,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread] Current Thread [Next in Thread>
  • Re: heartbeat node taking over resources upon reboot, Roberto Nibali <=