Hi folks,
This is my first post here, so don't be cruel ;)
I have a simple 2-nodes heartbeat (version 1.2.3) topology and it has been
working without failures for more than 2 years so let me first congratulate
to all who contributed to this nice piece of software.
It's a simple active-passive topology with an IP and just one service
takeover.
Yesterday we faced a problem where the passive node wrongly detected
that node1 had failed (80% probability due to CPU load, but still checking
connectivity) and started acquiring the resources (public IP and service).
Node2 hold therefore public IP (due to gratuitous ARP) and service. After
the load decreased on node1, both nodes could "speak" and node2 realized
that node1 was still alive. They started to move again to node1 both IP and
service but since node1 had already them working (IP and service)
it did not send the gratuitous ARP.
The problem was that since node2 had sent the gratuitous arp, the router
between the 2-nodes and the rest of the network kept the binding for the
public IP with node 2's MAC so it was impossible to access the service until
the router's ARP cache (one hour and a half) expired and tried again
to refresh the binding.
First action was to increase deadtime on heartbeat so punctual load problem
does not expose again the same issue but I'm afraid it's not enough and I
would like to ask you whether it would be safe to add a feature in ip_start
function of IPaddr script so everytime it is called, no matter if the node
is holding the IP, it send the gratuitous ARP. This way our problem will not
happen again because we automatically refresh the router ARP table
whenever we call ip_start function.
I know I have an older version of heartbeat, but it has worked so far pretty
well and I would prefer to stay with it because our topology is simple
enough (2 nodes active-passive) to be properly managed by this version.
Thanks in advance,
Samuel Osorio.
|