Re: Redirector project for FreeBSD

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: Redirector project for FreeBSD
Cc: srompf@xxxxxx
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Fri, 29 Mar 2002 15:09:35 +0100
Hello Alexandre,

> La pe`che ? :)

Qui pe`che avec ratz pe`che honorablement!
> a :) nice, and exactly what is needed for linux !. When developing
> networking code such zebra, VRRP, ... link state monitoring is really
> needed and must be included into the state machine code to not perturbe the
> network protocol native functionality.

the problem is that not all NICs support this phy->state checking.

> This mean that without NIC link state/hardware state monitoring it can
> introduce a side effect in many protocol. For example in VRRP, if the
> instance is in BACKUP state <=> waiting for remote MASTER adverts. If we
> unplug the BACKUP NIC but the MASTER still active, then BACKUP will not
> receive MASTER adverts and deduce MASTER is down... so will transit to
> MASTER state ....

This is a common problem with HA frameworks. Either make usage of a STOMITH capability or you try to write intelligent server/client software :). With intelligent I mean that you implement this phy->state checking on the MASTER thread and on the BACKUP state. If you unplug the network cable and the situation you described occurs, then make the policy that the BACKUP always goes up if he can (provided his phy->state is not down also) and that the MASTER always shuts down, independant of the phy->state. The MASTER will be active when he gets at least two solitation link beats from the BACKUP (now MASTER) telling him everything is back ok (the idiot that removed the cable had his coffee now and decided that the cable was vital to get the $2M e-commerce project online)

> This side effect is very noisy and can be sometime worked around with some > logics algo but still some side effect (for me "sync_instance" without link
> state reporting introduce a noisy loop into the state machine :/).

What do you mean by noisy loop? For me this is a nop in defined time intervals.

> Link state like MII-register monitoring is really needed and must be used
> in routing soft (like zebra :) as Stefan mention in his LKML mail).

Yes. But as you can read the following emails of that thread people tend to disagree to a certain extend. I suggest to Stefan to adjust the rest of the patch to have a completely independant patch (2 ifdef parts are missing) so he doesn't change the semantics if one chooses not to choose CONFIG_LINKWATCH. The parts are:


--- linux-2.4.18ac2/net/core/dev.c      Wed Mar 27 00:06:54 2002
+++ linux-2.4.18ac2-stefan/net/core/dev.c       Wed Mar 27 00:32:17 2002
@@ -812,7 +818,7 @@
         *      Device is now down.

- dev->flags &= ~IFF_UP; + dev->flags &= ~(IFF_UP | IFF_RUNNING);


--- linux-2.4.18ac2/net/core/Makefile   Wed Mar 27 00:06:54 2002
+++ linux-2.4.18ac2-stefan/net/core/Makefile    Mon Mar 25 21:54:26 2002
@@ -27,4 +27,6 @@
 obj-$(CONFIG_NET_DIVERT) += dv.o
 obj-$(CONFIG_NET_PROFILE) += profile.o

+obj-$(CONFIG_LINKWATCH) += link_watch.o
 include $(TOPDIR)/Rules.make


> => VRRP RFC spec must be complete with a FAULT state drived according to
> the NIC availibility.

Ugh, how is this possible? Do I understand you correctly that you would like to put in a policy for handling FAULT state that every NIC driver then must be able to handle?

> => This FAULT_STATE is really needed in the VRRP FSM => It place the VRRP
> instance in a "waiting for advert" state without the timeout handling for
> BACKUP_STATE. That way we are clean and effective :)


> When we develop a network software userspace like BGP, VRRP, ... (zebra in
> short) what is the best way to handle NIC state activity and state
> transition ?

I've the same questions especially since `ip link set dev eth0 state down` should trigger the failover but of course it doesn't since the PHY register isn't updated only the routing.

> For me the ideal solution would be a netlink broadcast message on
> IFF_RUNING validity. Nice but need to patch/rebuild kernel. And need to
> wait for official integration patch.... But for me it is the final wanted
> functionality for NIC state notification.

Well, patching the kernel is not as bad as it sounds, LVS has done it with success for years now. Noone is complaining, except maybe Joe and me ;). That's why I would like to see Stefan's patch clean and completely independant.

> If we look on MII code, MII is present on most NIC so monitoring
> MII-register is the right way IMHO to handle NIC state notification. MII
> transciever can be probed from userspace using a specific ioctl to
> SIOCGMIIPHY. This userspace tool can be generic and portable to other
> kernels to permit support of this for kernel 2.2 users.
This is not at all implemented on all NICs but you could make a tradeoff which would probably address 95% of the people which would deploy keepalived/vrrpd: Take the 3-5 most common NICs and add support there. You might want to check the status of pollable SIOCGMIIPHY && getting the right information of various NICs from Jeff Garzik.

> IMHO MII polling (in the VRRP code) can be done throught a MII probe before
> each sending VRRP advert thread. That way the soft will monitor MII
> transceiver every secondes (since in MASTER state adverts are sent every
> secondes). And in BACKUP state if the VRRP state machine want to transit to
> MASTER state (no remote MASTER adverts received) MII probe will detect if
> this is the remote MASTER down or a link loose from itself. => So the MII
> states will condition the new VRRP FAULT_STATE. => That way takeover will
> be more quicker enven sync_instance because probe will be done in both
> states MASTER & BACKUP.

You have to make sure that if the MASTER detects link failure you shut it down, since the BACKUP is about to come up.

> For the MII I need to export mii-diag code into a layer1.c lib in
> keepalived. Basically functions probing MII transiver. The functions will
> be :
> o int mii_tranceiver_present(int ifindex); => checking MII availibility
> throught SIOCGMIIPHY ioctl call

The problem is that the SIOCGMIIPHY is not supported by all devices. You need to add all the structures in 'struct mii_if_info mii' as mdio_read() calls to the driver. Example:

laphish:~ # ./iftest eth0
Interface [eth0] is up
ioctl(SIOCGMIIPHY): Operation not supported
ioctl(SIOCGMIIREG): Operation not supported
laphish:~ # cat iftest.c
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <linux/if.h>
#include <linux/sockios.h>
#include <sys/ioctl.h>

int main(int argc, char *argv[]){
        struct ifreq ifr;
        char *device="lo";
        int s;

        if (argc>1){
        } else {
                printf("Please set a device.\n");

        if ((s=socket(PF_PACKET, SOCK_DGRAM, 0))<0){
        memset(&ifr, 0, sizeof(ifr));
        strncpy(ifr.ifr_name, device, IFNAMSIZ-1);
        if (ioctl(s, SIOCGIFINDEX, &ifr) < 0) {
                printf("Unknown interface %s\n", device);
        if (ioctl(s, SIOCGIFFLAGS, &ifr)) {
        if (!(ifr.ifr_flags&IFF_UP)) {
                printf("Interface [%s] is down\n", device);
        } else if (ifr.ifr_flags&(IFF_NOARP|IFF_LOOPBACK)){
                printf("Loopback interface which is not ARPable\n");
        } else {
                printf("Interface [%s] is up\n", device);
        if (ioctl(s, SIOCGMIIPHY, &ifr) < 0) {
        if (ioctl(s, SIOCGMIIREG, &ifr) < 0) {
laphish:~ # ip link set dev eth0 down
laphish:~ # ./iftest eth0
Interface [eth0] is down
ioctl(SIOCGMIIPHY): Operation not supported
ioctl(SIOCGMIIREG): Operation not supported
laphish:~ # ip link set dev eth0 up
laphish:~ # ./iftest eth0
Interface [eth0] is up
ioctl(SIOCGMIIPHY): Operation not supported
ioctl(SIOCGMIIREG): Operation not supported
laphish:~ #

> o struct MII *mii_probe(int ifindex); => probing and fetching MII infos =>
> during VRRP bootstrap

And these are not present on a wide range of NICs

> o int mii_linkup(int ifindex); => does MII report a properly functional
> link beat ?

On certain NICs I think so, but I'm not sure there.

> that is all :)

Well, you've got Easter time now. Send your wife to some nice holiday trip and start coding.

> => Will try today :) or this week end

You can contact me offline about the status of your development.

> :) I need to obtain agreements from my employer before... Is there a sign

Give me his phone number and I talk to him.

> up deadline ? signup is needed for OLS passport ?

There is no explicit deadline but OLS is(was?) _the_ linux kernel hacker event. It is no tradeshow, pure technical talks and BOFs. You certainly don't wanna miss it, plus since in 2000 Jerome Etienne was a speaker there and talked about ARPsec and VRRP. You could show up and tell the people how much you've improved the code and the framework :)

Best regards,
Roberto Nibali, ratz

<Prev in Thread] Current Thread [Next in Thread>