Hello,
With something clear to
fix, we installed the latest version of keepalived on the latest RHEL4
kernel.
And lo, nothing changed.
lo also does arp probes.
Add a new host, it became a Fatal Attractor
within 6 minutes of operation (note that this is NOT the Thundering
Herd problem; things were relatively well balanced for a minute or 6).
arp cache? Did you run an ip -s -s route show cache; ip neigh show, ...?
Stranger yet, ipvsadm on the director revealed that the Attractor was
getting NO hits. So it wasn't that the LVS was sending all hits to one
machine. You guessed it. The new machine was arping for the shared ip,
and connections were coming directly to it.
;).
We had arptables set up as follows:
*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
COMMIT
And in desperation, started arptables at runlevel 1. This didn't help,
because it wasn't responding to an inbound arp request, but was
instead generating it's OWN arp request, and broadcasting the response
it made to itself.
This could be seen with:
tcpdump -i any arp > file
And then pawing through the file for the shared ip (name). So there
lies the smoking gun. Arptables was NOT working as advertised. So we
added:
-A OUT -d 192.168.0.12 -j mangle --mangle-ip-s 192.168.0.104
???
This still did not do the trick; apparently arptables implicitly
operates on the interface owing the ip (lo:1, in our case), if no
interface is specified. That left eth0 leaking arps.
Specifying the interface did the trick:
-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104
And here is the whole filter:
*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104
COMMIT
Why the heck do people go through such a pain when there is proper
kernel support in proc-fs for arp probe handling?
arps are now properly squelched, and fatal attractor behavior has vanished.
I'm posting this because I longed for google to return such a message in
response to many searches.
I'm sorry you had to go though all this trouble, but what did hinder you
to use the standard arp_* settings in proc-fs? I'm just interested if
it's a documentation issue within the LVS project or something else?
Regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
|