Re: Redirector project for FreeBSD

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: Redirector project for FreeBSD
Cc: srompf@xxxxxx
From: "Alexandre CASSEN" <alexandre.cassen@xxxxxxxxxxxxxx>
Date: Fri, 29 Mar 2002 16:19:01 +0100
>> This side effect is very noisy and can be sometime
>> worked around with some logics algo but still some
>> side effect (for me "sync_instance"without link state
>> reporting introduce a noisy loop into the state machine :/).
>What do you mean by noisy loop? For me this is a nop in defined time

In short :
the situation I describe into the PDF file on VRRP (on cheetah) explain the
"synchronization_instance" mechanism. VRRP is a protocol interface specific
and this is usefull for us for LVS because during takeover routing path
must be preserved :

In detailed :
In a LVS-NAT env using VRRP (LVS1 & LVS2), realservers default GW is on
eth1 IP of LVS1 and virtual services are exposed to the Internet on eth0
LVS1. All is working => all is inbound/outbound traffic are going threw
LVS1. But, but, but : if LVS1(eth1) fails all associated IP address will be
unavailable and taked-over on the VRRP BACKUP LVS2(eth1) => outbound
traffic will go threw LVS2(eth1) and inbound threw LVS1(eth0) => This
disymetric routing will broke the LVS env.

=> So in our VRRP software we must add a "sync_instance" capability to
preserve code against that (a VRRP extension for our specific needs). This
functionality function using the axiom : If LVS1(eth1) fails then
LVS2(eth1) takeover and LVS2(eth0) become MASTER (for owning IP address).

=> This is the need : If now we use this VRRP extension and we are not able
to detect link state, we introduce the "noisy loop" => this mean :

In init state : LVS1(eth1) = MASTER, LVS1(eth0) = MASTER
                LVS2(eth1) = BACKUP, LVS2(eth0) = BACKUP
                LVS1 & LVS2 are using our axiom "sync_instance"

Now, for some reasons, someone unplug the wire on LVS2(eth1). So VRRP
instance will timeout and become MASTER. But axiom will force LVS2(eth0) to
transit to MASTER, so LVS1(eth0) will be BACKUP. But axiom say will force
the symetric on LVS1(eth1) (because all is going nice on LVS1(eth1)) => so
LVS1(eth1) will become BACKUP. Hear is the loop => LVS1(eth1) will timeout
receiving remote VRRP adverts since wire still unplugged ! so transit to
MASTER, then force LVS1(eth0) to MASTER and so LVS2(eth0) to BACKUP, finaly
LVS2(eth1) to BACKUP....... and infinite protocol loop :).... grrr...

The only way to break this noisy loop is to introduce a low-level MII
checker for probing physical state of the NIC.

>> => VRRP RFC spec must be complete with a FAULT state drived according to
>> the NIC availibility.
>Ugh, how is this possible? Do I understand you correctly that you would
>like to put in a policy for handling FAULT state that every NIC driver
>then must be able to handle?

no no :) my poor english again :) => This is just an extension of the VRRP
code to add a new state into the state machine drived with the NIC link
beat state => VRRP instance in MASTER or BACKUP will transit to FAULT_STATE
if MII reports bad things... And stay sticked until MII is OK.

Only in the VRRP code => FAULT_STATE is VRRP FSM specific and updated with
the low-level MII registers values.

> When we develop a network software userspace like BGP, VRRP, ...(zebra in
> short) what is the best way to handle NIC state activity and state
> transition ?
>I've the same questions especially since `ip link set dev eth0 state
>down` should trigger the failover but of course it doesn't since the PHY
>register isn't updated only the routing.

Yes we are ok :) Two event condition our VRRP FAULT_STATE :

2. MII registers values.

>> For me the ideal solution would be a netlink broadcast message on
>> IFF_RUNING validity. Nice but need to patch/rebuild kernel. And need to
>> wait for official integration patch.... But for me it is the final
>> functionality for NIC state notification.
>Well, patching the kernel is not as bad as it sounds, LVS has done it
>with success for years now. Noone is complaining, except maybe Joe and
>me ;). That's why I would like to see Stefan's patch clean and
>completely independant.

Yes, but can handle userspace in our VRRP framework until it is added into
the stable kernel banches to keep compatibility.

>> If we look on MII code, MII is present on most NIC so monitoring
>> MII-register is the right way IMHO to handle NIC state notification. MII
>> transciever can be probed from userspace using a specific ioctl to
>> SIOCGMIIPHY. This userspace tool can be generic and portable to other
>> kernels to permit support of this for kernel 2.2 users.
>This is not at all implemented on all NICs but you could make a tradeoff
>which would probably address 95% of the people which would deploy
>keepalived/vrrpd: Take the 3-5 most common NICs and add support there.
>You might want to check the status of pollable SIOCGMIIPHY && getting
>the right information of various NICs from Jeff Garzik.

Yes, have tested Garzik ethtool and it is not working properly on most of
the MII enabled NICs :/
So I am starting with the donald becker code wich is generic and working.

But, the use of a MII enabled NIC can be a requierment for runing VRRP. If
people is warned it can be acceptable.

>Well, you've got Easter time now. Send your wife to some nice holiday
>trip and start coding.

:) I really do not know if she will agree :))) : "Wait darling this is for
MII register....blabla :))))"

>> => Will try today :) or this week end
>You can contact me offline about the status of your development.

Ok thanks, I have started today with a netlink poller over IFF_UP/DOWN.

>> :) I need to obtain agreements from my employer before... Is there a
>Give me his phone number and I talk to him.


> up deadline ? signup is needed for OLS passport ?
>There is no explicit deadline but OLS is(was?) _the_ linux kernel hacker
>event. It is no tradeshow, pure technical talks and BOFs. You certainly
>don't wanna miss it, plus since in 2000 Jerome Etienne was a speaker
>there and talked about ARPsec and VRRP. You could show up and tell the
>people how much you've improved the code and the framework :)

Yes :) not really easy for me. And if jerome etienne has not reply to my
email I assume he will drop me... And really don't like that kind of

c u latter,

<Prev in Thread] Current Thread [Next in Thread>