LVS
lvs-devel
Google
 
Web LinuxVirtualServer.org

Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr schedulers

To: Sven Wegener <sven.wegener@xxxxxxxxxxx>
Subject: Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr schedulers
Cc: lvs-devel@xxxxxxxxxxxxxxx, netdev@xxxxxxxxxxxxxxx
From: Simon Horman <horms@xxxxxxxxxxxx>
Date: Wed, 13 Aug 2008 15:42:49 +1000
On Tue, Aug 12, 2008 at 03:07:48PM +0200, Sven Wegener wrote:
> On Tue, 12 Aug 2008, Sven Wegener wrote:
> 
> > On Tue, 12 Aug 2008, Simon Horman wrote:
> > 
> > > On Tue, Aug 12, 2008 at 12:57:21AM +0200, Sven Wegener wrote:
> > > > Both schedulers have a race condition that happens in the following 
> > > > situation:
> > > > 
> > > > We have an entry in our table that already has expired according to 
> > > > it's 
> > > > last use time. Then we need to schedule a new connection that uses this 
> > > > entry.
> > > > 
> > > > CPU 1                           CPU 2
> > > > 
> > > > ip_vs_lblc_schedule()
> > > >   ip_vs_lblc_get()
> > > >     lock table for read
> > > >     find entry
> > > >     unlock table
> > > >                                 ip_vs_lblc_check_expire()
> > > >                                   lock table for write
> > > >                                   kfree() expired entry
> > > >                                   unlock table
> > > >     return invalid entry
> > > > 
> > > > Problem is that we assign the last use time outside of our critical 
> > > > region. We can make hitting this race more difficult, if not 
> > > > impossible, 
> > > > if we assign the last use time while still holding the lock for 
> > > > reading. 
> > > > That gives us six minutes during which it's save to use the entry, 
> > > > which 
> > > > should be enough for our use case, as we're going to use it immediately 
> > > > and don't keep a long reference to it.
> > > > 
> > > > We're holding the lock for reading and not for writing. The last use 
> > > > time 
> > > > is an unsigned long, so the assignment should be atomic by itself. And 
> > > > we 
> > > > don't care, if some other user sets it to a slightly different value. 
> > > > The 
> > > > read_unlock() implies a barrier so that other CPUs see the new last use 
> > > > time during cleanup, even if we're just using a read lock.
> > > > 
> > > > Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() 
> > > > with 
> > > > write_lock()ing the lock, 2) add reference counting for the entries, 3) 
> > > > protect each entry with it's own lock. And all are bad for performance.
> > > > 
> > > > Comments? Ideas?
> > > 
> > > Is there a pathological case here if sysctl_ip_vs_lblc_expiration is
> > > set to be very short and we happen to hit ip_vs_lblc_full_check()?
> > 
> > Yes.
> > 
> > > To be honest I think that I like the reference count approach best,
> > > as it seems safe and simple. Is it really going to be horrible
> > > for performance?
> > 
> > Probably not, I guess the sentence was a bit pessimistic.
> > 
> > > If so, I wonder if a workable solution would be to provide a more 
> > > fine-grained
> > > lock on tbl. Something like the way that ct_read_lock/unlock() works.
> > 
> > Also possible. But I guess I was thinking too complicated last night. What 
> > I was after with the "protect the whole ip_vs_lblc_schedule() with 
> > write_lock()ing the lock" was also to simply prevent someone adding 
> > duplicate entries. If we just extend the read_lock() region to cover the 
> > whole usage of the entry and do an additional duplicate check during 
> > inserting the entry under write_lock(), we fix the issue and also fix the 
> > race that someone may add duplicate entries. We have a bit overhead, 
> > because we unlock/lock and also there is a chance of doing one or more 
> > useless __ip_vs_wlc_schedule(), but in the end this should work. More 
> > fine-grained locking could help to lower the impact.
> 
> I wondered if this whole thing can ever be totally race condition free, 
> without changing how destinations are purged from the trash. Initial 
> version of the patch below. Basically it pushes the locking up into 
> ip_vs_lblc_schedule() and rearranges the code to be safe that we have a 
> valid destination.

Hi Sven,

I think that I am happy with this, except for updating lastuse
under a read lock, is it really safe? Do you still have any concerns?

I've annotated the code with my thoughts on how your locking is working
- thinking aloud if you like.

> diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
> index 7a6a319..67f7b04 100644
> --- a/net/ipv4/ipvs/ip_vs_lblc.c
> +++ b/net/ipv4/ipvs/ip_vs_lblc.c
> @@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = {
>  
>  static struct ctl_table_header * sysctl_header;
>  
> -/*
> - *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> - *      IP address to a server.
> - */
> -static inline struct ip_vs_lblc_entry *
> -ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
> -{
> -     struct ip_vs_lblc_entry *en;
> -
> -     en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC);
> -     if (en == NULL) {
> -             IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> -             return NULL;
> -     }
> -
> -     INIT_LIST_HEAD(&en->list);
> -     en->addr = daddr;
> -
> -     atomic_inc(&dest->refcnt);
> -     en->dest = dest;
> -
> -     return en;
> -}
> -
> -
>  static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en)
>  {
>       list_del(&en->list);
> @@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct 
> ip_vs_lblc_entry *en)
>        */
>       hash = ip_vs_lblc_hashkey(en->addr);
>  
> -     write_lock(&tbl->lock);
>       list_add(&en->list, &tbl->bucket[hash]);
>       atomic_inc(&tbl->entries);
> -     write_unlock(&tbl->lock);
>  
>       return 1;
>  }
> @@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 
> addr)
>  
>       hash = ip_vs_lblc_hashkey(addr);
>  
> -     read_lock(&tbl->lock);
> -
>       list_for_each_entry(en, &tbl->bucket[hash], list) {
>               if (en->addr == addr) {
>                       /* HIT */
> -                     read_unlock(&tbl->lock);
>                       return en;
>               }
>       }
>  
> -     read_unlock(&tbl->lock);
> -
>       return NULL;
>  }
>  
>  
>  /*
> + *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> + *      IP address to a server.
> + */
> +static inline struct ip_vs_lblc_entry *
> +ip_vs_lblc_new(struct ip_vs_lblc_table *tbl, __be32 daddr,
> +            struct ip_vs_dest *dest)
> +{
> +     struct ip_vs_lblc_entry *en;
> +
> +     en = ip_vs_lblc_get(tbl, daddr);
> +     if (!en) {
> +             en = kmalloc(sizeof(*en), GFP_ATOMIC);
> +             if (!en) {
> +                     IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> +                     return NULL;
> +             }
> +
> +             INIT_LIST_HEAD(&en->list);
> +             en->addr = daddr;
> +             en->lastuse = jiffies;
> +
> +             atomic_inc(&dest->refcnt);
> +             en->dest = dest;
> +
> +             ip_vs_lblc_hash(tbl, en);
> +     } else if (en->dest != dest) {
> +             atomic_dec(&en->dest->refcnt);
> +             atomic_inc(&dest->refcnt);
> +             en->dest = dest;
> +     }
> +
> +     return en;
> +}
> +
> +
> +/*
>   *      Flush all the entries of the specified table.
>   */
>  static void ip_vs_lblc_flush(struct ip_vs_lblc_table *tbl)
> @@ -484,7 +488,7 @@ is_overloaded(struct ip_vs_dest *dest, struct 
> ip_vs_service *svc)
>  static struct ip_vs_dest *
>  ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
>  {
> -     struct ip_vs_dest *dest;
> +     struct ip_vs_dest *dest = NULL;
>       struct ip_vs_lblc_table *tbl;
>       struct ip_vs_lblc_entry *en;
>       struct iphdr *iph = ip_hdr(skb);
> @@ -492,38 +496,43 @@ ip_vs_lblc_schedule(struct ip_vs_service *svc, const 
> struct sk_buff *skb)
>       IP_VS_DBG(6, "ip_vs_lblc_schedule(): Scheduling...\n");
>  
>       tbl = (struct ip_vs_lblc_table *)svc->sched_data;
> +     /* First look in our cache */
> +     read_lock(&tbl->lock);
>       en = ip_vs_lblc_get(tbl, iph->daddr);
> -     if (en == NULL) {
> -             dest = __ip_vs_wlc_schedule(svc, iph);
> -             if (dest == NULL) {
> -                     IP_VS_DBG(1, "no destination available\n");
> -                     return NULL;
> -             }
> -             en = ip_vs_lblc_new(iph->daddr, dest);
> -             if (en == NULL) {
> -                     return NULL;
> -             }
> -             ip_vs_lblc_hash(tbl, en);
> -     } else {
> +     if (en) {
>               dest = en->dest;
> -             if (!(dest->flags & IP_VS_DEST_F_AVAILABLE)
> -                 || atomic_read(&dest->weight) <= 0
> -                 || is_overloaded(dest, svc)) {
> -                     dest = __ip_vs_wlc_schedule(svc, iph);
> -                     if (dest == NULL) {
> -                             IP_VS_DBG(1, "no destination available\n");
> -                             return NULL;
> -                     }
> -                     atomic_dec(&en->dest->refcnt);
> -                     atomic_inc(&dest->refcnt);
> -                     en->dest = dest;
> -             }
> +             en->lastuse = jiffies;

Setting en->lastuse with a read lock held makes me feel uncomfortable.
Are you sure it is safe?

> +
> +             /*
> +              * If the destination is not available, i.e. it's in the trash,
> +              * ignore it, as it may be removed from under our feet
> +              */
> +
> +             if (!(dest->flags & IP_VS_DEST_F_AVAILABLE))
> +                     dest = NULL;
>       }
> -     en->lastuse = jiffies;

At this point we know that dest exists and is not in the trash.
We know that dest can't disapear given that its not already
in the trash and our caller holds a read lock on __ip_vs_svc_lock.
So dest will be safe until after ip_vs_lblc_schedule() returns.

Dest seems ok :-)

Ok, that seems complex but non-racy to me :-)

Perhaps a longer comment would be in order.

> +     read_unlock(&tbl->lock);

After the lock is released we may now lose en, but we don't care
as its not used again inside ip_vs_lblc_schedule().

So far so good :-)

> +
> +     /* If the destination has a weight and is not overloaded, use it */
> +     if (dest && atomic_read(&dest->weight) > 0 && !is_overloaded(dest, svc))
> +             goto out;
> +
> +     /* No cache entry or it is invalid, time to schedule */
> +     dest = __ip_vs_wlc_schedule(svc, iph);
> +     if (!dest) {
> +             IP_VS_DBG(1, "no destination available\n");
> +             return NULL;
> +     }

This new dest must also not be in the trash by virtue of
being returned by __ip_vs_wlc_schedule() and delitions from
svc->destinations being protected by __ip_vs_svc_lock, which
is held by the caller.

Also good :-)

BTW, the name of __ip_vs_wlc_schedule, really ought to be changed
to __ip_vs_lblc_schedule. I'll make a trivial patch for that.

> +
> +     /* If we fail to create a cache entry, we'll just use the valid dest */
> +     write_lock(&tbl->lock);
> +     ip_vs_lblc_new(tbl, iph->daddr, dest);
> +     write_unlock(&tbl->lock);

This write lock looks obviously correct as the destructors
do the same thing.

> +out:
>       IP_VS_DBG(6, "LBLC: destination IP address %u.%u.%u.%u "
>                 "--> server %u.%u.%u.%u:%d\n",
> -               NIPQUAD(en->addr),
> +               NIPQUAD(iph->daddr),
>                 NIPQUAD(dest->addr),
>                 ntohs(dest->port));
>  
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

<Prev in Thread] Current Thread [Next in Thread>