Hi List!
We have set up a 4-Node LVS cluster with 2 real Nodes and redundant
Director using RH AS 2.1 and Piranha (yeah, yeah, I know). Everything is
working fine, until we migrate the director from the primary to the backup
server.
Each director essentially has only one network interface with all
LVS-related adresses being aliased interfaces. Let's call the primary
machine A, the backup machine B. The network configuration looks solething
like this:
eth0 - public Address of Machine
eth0:0 - Public LVS-Address
eth0:1 - NAT-Router private adress
In an failover-event, the :0 and :1-interfaces migrate from machine A to
machine B and vice versa.
Now, this is where the fun starts. The real nodes have absolutely no
problem with the failover event, everything just keeps working fine. The
client machines are being taken care of by the gratitious ARPs sent by
the pulse-daemon, so this keeps working, too.
*BUT*
If some client machine has to do a new arp-request, sometimes the now
secondary machines answeres it! Meaning: Machine B is director, having
taken over service from machine A, but both are still running. This
happens e.g. during maintenance. Machine B has both :0 and :1-Adresses,
machine A does no longer (verifiable by ifconfig).
Using tcpdump we could see machine A still answering arp-requests for
the public LVS-Address, even though it is now assigned to machine B who
*should* be answering. Huh?
Any ideas, anyone? If there is any info missing, please don't hesitate
to ask!
Dipl. Chem. Dr. Stephan Wonczak
Institut fuer Angewandte Informatik (ZAIK)
Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
Universitaet zu Koeln, Robert-Koch-Strasse 10, 50931 Koeln
Tel: ++49/(0)221/478-5577, Fax: ++49/(0)221/478-5590
|