Re: VRRP and the kernel

To: Julian Anastasov <ja@xxxxxx>
Subject: Re: VRRP and the kernel
Cc: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
From: "Alexandre CASSEN" <alexandre.cassen@xxxxxxxxxxxxxx>
Date: Fri, 23 Nov 2001 14:29:23 +0100

>> The problem : Linux kernel doesn't permit to work with many MAC on a
>> NIC (only one at a time)
>> For the moment we have some way to solve this problem :
>> 1. Do not handle VMAC just send gratuitous ARP during VRRP VIP takeover
>    So, it is not recommended for use because we don't reply to
>valid ARP probes with our VMAC?

=> Yes. VMAC<->VIP change and if we use switch one VMAC still stalled into
another CAM entry port...

>> 2. Use a userspace ARP daemon replying ARP request => At this point we
>> think that userspace is probably not the rught place to handle this
>> issue...
>> 3. Aplpy a kernel space patch to reply ARP requests for VRRP VIP => so
>> will be able to handle as many VMAC as VRRP Virtual Router => That way
>> will be RFC compliant.
>> 4....
>    Is it VRRP really working with LVS because I see that LVS
>works only with PACKET_HOST packets.

Currently it's work with gratuitous ARP updating remotes caches (5
gratuitous ARP on each VIP during VIPs takeover). But it is not RFC
compliant. PACKET_HOST yes ... For LVS VRRP VIP(s) are simple secondary
IP(s) address(es). To simulate VRRP VIP takeover it consist of :

1. If backup router doesn't receive the MASTER multicast VRRP advert
2. Then transition to MASTER state
3. So BACKUP router become MASTER and set VIP using netlink call (the same
as : ip address add dev ethX)

=> And in our LVS configuration LVS VIP(s) = VRRP VIP(s)

>What I understand from the kernel
>sources is that by using VRRP we will receive packets to VIP marked
>as skb->pkt_type = PACKET_MULTICAST (due to the VRRP VMAC prefix) but
>rt_type = RTN_LOCAL (VIP is local).

VRRP packet advert source IP header is primary IP address of remote MASTER.

VRRP Instance advert socket creation is :

fd = socket(AF_INET, SOCK_RAW, proto); /* proto = IPPROTO_VRRP or
  /* inbound binding */
  ret = setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
                   ifname, strlen(ifname)+1);
  /* outbound binding */
  memset(&req_add, 0, sizeof (req_add));
  req_add.imr_multiaddr.s_addr = htonl(INADDR_VRRP_GROUP);
  req_add.imr_address.s_addr = htonl(index_to_ip(index));
  req_add.imr_ifindex = index;
  ret = setsockopt(fd, IPPROTO_IP, IP_ADD_MEMBERSHIP,
                   (char *)&req_add, sizeof(struct ip_mreqn));

So skb->pkt_type = PACKET_MULTICAST will be set only for a non VRRP VIP =
MASTER interface primary IP address. Only primary IP address interface join

so for VIP skbuf pckt type will be PACKET_HOST & as VIP is RTN_LOCAL, LVS
can handle traffic on VRRP VIP ? are you agreed ?

=> To test I have done a real network topology for VRRP. This mean that 2
LVS Directors (LD1,LD2). 2 NIC on each LD. 4 VRRP instance running on each
interface (exactly the topology described into the LVS-HA-using-VRRPv2.pdf
on cheetah).

IP header src field is filled with IP address associated with the ifindex
on which VRRP Instance is running.

=> VRRP VMAC is the same for all IP addresses of a Virtual router ID. so
VRID own VIPs with only one VMAC=
  /* complete the VMAC address */
  vsrv->hwaddr[0] = 0x00;
  vsrv->hwaddr[1] = 0x00;
  vsrv->hwaddr[2] = 0x5E;
  vsrv->hwaddr[3] = 0x00;
  vsrv->hwaddr[4] = 0x01;
  vsrv->hwaddr[5] = vsrv->vrid;

=> The VRID identify an VRRP Instance. If we run multiple VRRP Instance all
VRID are defferents.

>I prefer we to keep the VIP-VMAC table device independent
>or we have to duplicate the VIP->VMAC list for each device.


>Because we
>can receive ARP probe "who-has VIP1 tell UNIQUE_IP" from many devices
>and we have to send same reply through all of them. This is different
>from the normal behavior for PACKET_HOST IPs because we send the ARP
>replies always with the device's MAC. But replying with same VMAC
>through all devices can cause problems with the switch?

Yes because switch can place a VMAC into a "blocker_table switch" since
switch can understand a MAC loop (for example cabletron/enterasys switch
securefast code will do this...)... or must deal with spaning trees....
(but still swiss army knife solution :))

>    If we reply through each device with a different VMAC for
>same IP then we can end up with "ARP race": the remote hosts will
>see how each VIP changes frequently its MAC because all these devices
>in our VRRP router are attached to same hub. Nothing different from
>the current behavior. So, it seems each device needs its own VMAC
>and then a global VIP->VMAC table or may be this is not possible.
>What is the current state in keepalived?

keepalived use the same MAC for all VRID (VRRP Instances) this MAC is the
manufactured physical NIC MAC. So all VRRP Instances use the same MAC @.

>I remember something for
>different instances per device but how is that related to the ARP?
>Do we know with what MAC we should send our ARP reply considering
>the requested IP and the input device where the ARP probes was

Yes we can know... Using the VRRP Instance state. If a VRRP Instance is in
BACKUP state it doesn't reply ARP request with VMAC of its VRID. So
symetrically in MASTER state it reply to ARP request using VMAC VRID.

>> >> Yes ARP replies... In fact currently I use simple gratuitous ARP to
>> update
>> >> remote caches... but during IP takeover using this technic I have a
>> >> expiration... Do you think that gratuitous ARP using the real NIC and
>> ARP
>> >
>> >    Which TTL expires?
>> Probing the current VRRP implementation during IP takeover when I let a
>> ping on a VRRP VIP (on a third party workstation). When takeover appear,
>> have TTL expiration (I do not understand really why...)... no packets
>> lost but IP takeover introduce this strange TTL expiration (probably due
>> gratuitous ARP to update cache, or my switch, ...). If I use VMAC (one
at a
>> time), no TTL expiration... because MAC address still the same for the
>> takeover.
>    I assume the case is that you receive ICMP_TIME_EXCEEDED with
>ICMP_EXC_TTL from some host? Then it seems nobody wants to accept this
>packet locally and it loops between two routers? Is that the case
>considering your routing topology? When TTL reaches 1 and one of
>the routers replies to you with ICMP?

hmm... will tcpdump it :)

>> => Gratutitous ARP are not really needed during takeover... only if we
>> using switch... need to update switch CAM table (VMAC1 change from
>> port1 to switch port2 for example).
>    Agreed. But may be it is useful to update the expiration timers
>for the remote hosts' ARP entries (I don't know how much takes the
>failover, may be they will mark the VIP as staled, this can be bad
>for setups with passive dead gateway detection).

agreed :)


<Prev in Thread] Current Thread [Next in Thread>