LVS
lvs-devel
Google
 
Web LinuxVirtualServer.org

Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr schedulers

To: Sven Wegener <sven.wegener@xxxxxxxxxxx>
Subject: Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr schedulers
Cc: lvs-devel@xxxxxxxxxxxxxxx, netdev@xxxxxxxxxxxxxxx
From: Simon Horman <horms@xxxxxxxxxxxx>
Date: Wed, 13 Aug 2008 17:12:22 +1000
On Wed, Aug 13, 2008 at 08:59:32AM +0200, Sven Wegener wrote:
> On Wed, 13 Aug 2008, Simon Horman wrote:
> 
> > On Tue, Aug 12, 2008 at 03:07:48PM +0200, Sven Wegener wrote:
> > > On Tue, 12 Aug 2008, Sven Wegener wrote:
> > > 
> > > > On Tue, 12 Aug 2008, Simon Horman wrote:
> > > > 
> > > > > On Tue, Aug 12, 2008 at 12:57:21AM +0200, Sven Wegener wrote:
> > > > > > Both schedulers have a race condition that happens in the following 
> > > > > > situation:
> > > > > > 
> > > > > > We have an entry in our table that already has expired according to 
> > > > > > it's 
> > > > > > last use time. Then we need to schedule a new connection that uses 
> > > > > > this 
> > > > > > entry.
> > > > > > 
> > > > > > CPU 1                           CPU 2
> > > > > > 
> > > > > > ip_vs_lblc_schedule()
> > > > > >   ip_vs_lblc_get()
> > > > > >     lock table for read
> > > > > >     find entry
> > > > > >     unlock table
> > > > > >                                 ip_vs_lblc_check_expire()
> > > > > >                                   lock table for write
> > > > > >                                   kfree() expired entry
> > > > > >                                   unlock table
> > > > > >     return invalid entry
> > > > > > 
> > > > > > Problem is that we assign the last use time outside of our critical 
> > > > > > region. We can make hitting this race more difficult, if not 
> > > > > > impossible, 
> > > > > > if we assign the last use time while still holding the lock for 
> > > > > > reading. 
> > > > > > That gives us six minutes during which it's save to use the entry, 
> > > > > > which 
> > > > > > should be enough for our use case, as we're going to use it 
> > > > > > immediately 
> > > > > > and don't keep a long reference to it.
> > > > > > 
> > > > > > We're holding the lock for reading and not for writing. The last 
> > > > > > use time 
> > > > > > is an unsigned long, so the assignment should be atomic by itself. 
> > > > > > And we 
> > > > > > don't care, if some other user sets it to a slightly different 
> > > > > > value. The 
> > > > > > read_unlock() implies a barrier so that other CPUs see the new last 
> > > > > > use 
> > > > > > time during cleanup, even if we're just using a read lock.
> > > > > > 
> > > > > > Other solutions would be: 1) protect the whole 
> > > > > > ip_vs_lblc_schedule() with 
> > > > > > write_lock()ing the lock, 2) add reference counting for the 
> > > > > > entries, 3) 
> > > > > > protect each entry with it's own lock. And all are bad for 
> > > > > > performance.
> > > > > > 
> > > > > > Comments? Ideas?
> > > > > 
> > > > > Is there a pathological case here if sysctl_ip_vs_lblc_expiration is
> > > > > set to be very short and we happen to hit ip_vs_lblc_full_check()?
> > > > 
> > > > Yes.
> > > > 
> > > > > To be honest I think that I like the reference count approach best,
> > > > > as it seems safe and simple. Is it really going to be horrible
> > > > > for performance?
> > > > 
> > > > Probably not, I guess the sentence was a bit pessimistic.
> > > > 
> > > > > If so, I wonder if a workable solution would be to provide a more 
> > > > > fine-grained
> > > > > lock on tbl. Something like the way that ct_read_lock/unlock() works.
> > > > 
> > > > Also possible. But I guess I was thinking too complicated last night. 
> > > > What 
> > > > I was after with the "protect the whole ip_vs_lblc_schedule() with 
> > > > write_lock()ing the lock" was also to simply prevent someone adding 
> > > > duplicate entries. If we just extend the read_lock() region to cover 
> > > > the 
> > > > whole usage of the entry and do an additional duplicate check during 
> > > > inserting the entry under write_lock(), we fix the issue and also fix 
> > > > the 
> > > > race that someone may add duplicate entries. We have a bit overhead, 
> > > > because we unlock/lock and also there is a chance of doing one or more 
> > > > useless __ip_vs_wlc_schedule(), but in the end this should work. More 
> > > > fine-grained locking could help to lower the impact.
> > > 
> > > I wondered if this whole thing can ever be totally race condition free, 
> > > without changing how destinations are purged from the trash. Initial 
> > > version of the patch below. Basically it pushes the locking up into 
> > > ip_vs_lblc_schedule() and rearranges the code to be safe that we have a 
> > > valid destination.
> > 
> > Hi Sven,
> > 
> > I think that I am happy with this, except for updating lastuse
> > under a read lock, is it really safe? Do you still have any concerns?
> > 
> > I've annotated the code with my thoughts on how your locking is working
> > - thinking aloud if you like.
> 
> Your locking description is exactly how it was intented by me.

Great, I wanted to make sure that we are both on the same page.

> > > diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
> > > index 7a6a319..67f7b04 100644
> > > --- a/net/ipv4/ipvs/ip_vs_lblc.c
> > > +++ b/net/ipv4/ipvs/ip_vs_lblc.c
> > > @@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = {
> > >  
> > >  static struct ctl_table_header * sysctl_header;
> > >  
> > > -/*
> > > - *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> > > - *      IP address to a server.
> > > - */
> > > -static inline struct ip_vs_lblc_entry *
> > > -ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
> > > -{
> > > - struct ip_vs_lblc_entry *en;
> > > -
> > > - en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC);
> > > - if (en == NULL) {
> > > -         IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> > > -         return NULL;
> > > - }
> > > -
> > > - INIT_LIST_HEAD(&en->list);
> > > - en->addr = daddr;
> > > -
> > > - atomic_inc(&dest->refcnt);
> > > - en->dest = dest;
> > > -
> > > - return en;
> > > -}
> > > -
> > > -
> > >  static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en)
> > >  {
> > >   list_del(&en->list);
> > > @@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct 
> > > ip_vs_lblc_entry *en)
> > >    */
> > >   hash = ip_vs_lblc_hashkey(en->addr);
> > >  
> > > - write_lock(&tbl->lock);
> > >   list_add(&en->list, &tbl->bucket[hash]);
> > >   atomic_inc(&tbl->entries);
> > > - write_unlock(&tbl->lock);
> > >  
> > >   return 1;
> > >  }
> > > @@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 
> > > addr)
> > >  
> > >   hash = ip_vs_lblc_hashkey(addr);
> > >  
> > > - read_lock(&tbl->lock);
> > > -
> > >   list_for_each_entry(en, &tbl->bucket[hash], list) {
> > >           if (en->addr == addr) {
> > >                   /* HIT */
> > > -                 read_unlock(&tbl->lock);
> > >                   return en;
> > >           }
> > >   }
> > >  
> > > - read_unlock(&tbl->lock);
> > > -
> > >   return NULL;
> > >  }
> > >  
> > >  
> > >  /*
> > > + *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
> > > + *      IP address to a server.
> > > + */
> > > +static inline struct ip_vs_lblc_entry *
> > > +ip_vs_lblc_new(struct ip_vs_lblc_table *tbl, __be32 daddr,
> > > +        struct ip_vs_dest *dest)
> > > +{
> > > + struct ip_vs_lblc_entry *en;
> > > +
> > > + en = ip_vs_lblc_get(tbl, daddr);
> > > + if (!en) {
> > > +         en = kmalloc(sizeof(*en), GFP_ATOMIC);
> > > +         if (!en) {
> > > +                 IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
> > > +                 return NULL;
> > > +         }
> > > +
> > > +         INIT_LIST_HEAD(&en->list);
> > > +         en->addr = daddr;
> > > +         en->lastuse = jiffies;
> > > +
> > > +         atomic_inc(&dest->refcnt);
> > > +         en->dest = dest;
> > > +
> > > +         ip_vs_lblc_hash(tbl, en);
> > > + } else if (en->dest != dest) {
> > > +         atomic_dec(&en->dest->refcnt);
> > > +         atomic_inc(&dest->refcnt);
> > > +         en->dest = dest;
> > > + }
> > > +
> > > + return en;
> > > +}
> > > +
> > > +
> > > +/*
> > >   *      Flush all the entries of the specified table.
> > >   */
> > >  static void ip_vs_lblc_flush(struct ip_vs_lblc_table *tbl)
> > > @@ -484,7 +488,7 @@ is_overloaded(struct ip_vs_dest *dest, struct 
> > > ip_vs_service *svc)
> > >  static struct ip_vs_dest *
> > >  ip_vs_lblc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
> > >  {
> > > - struct ip_vs_dest *dest;
> > > + struct ip_vs_dest *dest = NULL;
> > >   struct ip_vs_lblc_table *tbl;
> > >   struct ip_vs_lblc_entry *en;
> > >   struct iphdr *iph = ip_hdr(skb);
> > > @@ -492,38 +496,43 @@ ip_vs_lblc_schedule(struct ip_vs_service *svc, 
> > > const struct sk_buff *skb)
> > >   IP_VS_DBG(6, "ip_vs_lblc_schedule(): Scheduling...\n");
> > >  
> > >   tbl = (struct ip_vs_lblc_table *)svc->sched_data;
> > > + /* First look in our cache */
> > > + read_lock(&tbl->lock);
> > >   en = ip_vs_lblc_get(tbl, iph->daddr);
> > > - if (en == NULL) {
> > > -         dest = __ip_vs_wlc_schedule(svc, iph);
> > > -         if (dest == NULL) {
> > > -                 IP_VS_DBG(1, "no destination available\n");
> > > -                 return NULL;
> > > -         }
> > > -         en = ip_vs_lblc_new(iph->daddr, dest);
> > > -         if (en == NULL) {
> > > -                 return NULL;
> > > -         }
> > > -         ip_vs_lblc_hash(tbl, en);
> > > - } else {
> > > + if (en) {
> > >           dest = en->dest;
> > > -         if (!(dest->flags & IP_VS_DEST_F_AVAILABLE)
> > > -             || atomic_read(&dest->weight) <= 0
> > > -             || is_overloaded(dest, svc)) {
> > > -                 dest = __ip_vs_wlc_schedule(svc, iph);
> > > -                 if (dest == NULL) {
> > > -                         IP_VS_DBG(1, "no destination available\n");
> > > -                         return NULL;
> > > -                 }
> > > -                 atomic_dec(&en->dest->refcnt);
> > > -                 atomic_inc(&dest->refcnt);
> > > -                 en->dest = dest;
> > > -         }
> > > +         en->lastuse = jiffies;
> > 
> > Setting en->lastuse with a read lock held makes me feel uncomfortable.
> > Are you sure it is safe?
> 
> It looks suspicous, but word-size writes should be atomic by itself. So 
> even doing it under read lock should not lead to corrupted values. We 
> don't care if the values that are set concurrently are overwritten, 
> because they should only differ in a couple of jiffies. Might be good to 
> have someone other comment on the issue too.

Ok. I was only worried about the atomicity. I agree that having the
value rewritten and potentially slightly inaccurate is not a concern.

> > > +
> > > +         /*
> > > +          * If the destination is not available, i.e. it's in the trash,
> > > +          * ignore it, as it may be removed from under our feet
> > > +          */
> > > +
> > > +         if (!(dest->flags & IP_VS_DEST_F_AVAILABLE))
> > > +                 dest = NULL;
> > >   }
> > > - en->lastuse = jiffies;
> > 
> > At this point we know that dest exists and is not in the trash.
> > We know that dest can't disapear given that its not already
> > in the trash and our caller holds a read lock on __ip_vs_svc_lock.
> > So dest will be safe until after ip_vs_lblc_schedule() returns.
> > 
> > Dest seems ok :-)
> > 
> > Ok, that seems complex but non-racy to me :-)
> > 
> > Perhaps a longer comment would be in order.
> 
> Yeah, that's the point where we prevent the race with the trash purging. 
> We could change the purging of destinations from the trash to be under 
> __ip_vs_svc_lock write locked, then we know that all destinations, even 
> the ones in the trash are valid. Might make more sense than duplicating 
> this logic in other schedulers.

That sounds like a good way to simplify things.

> > > + read_unlock(&tbl->lock);
> > 
> > After the lock is released we may now lose en, but we don't care
> > as its not used again inside ip_vs_lblc_schedule().
> > 
> > So far so good :-)
> > 
> > > +
> > > + /* If the destination has a weight and is not overloaded, use it */
> > > + if (dest && atomic_read(&dest->weight) > 0 && !is_overloaded(dest, svc))
> > > +         goto out;
> > > +
> > > + /* No cache entry or it is invalid, time to schedule */
> > > + dest = __ip_vs_wlc_schedule(svc, iph);
> > > + if (!dest) {
> > > +         IP_VS_DBG(1, "no destination available\n");
> > > +         return NULL;
> > > + }
> > 
> > This new dest must also not be in the trash by virtue of
> > being returned by __ip_vs_wlc_schedule() and delitions from
> > svc->destinations being protected by __ip_vs_svc_lock, which
> > is held by the caller.
> > 
> > Also good :-)
> > 
> > BTW, the name of __ip_vs_wlc_schedule, really ought to be changed
> > to __ip_vs_lblc_schedule. I'll make a trivial patch for that.
> > 
> > > +
> > > + /* If we fail to create a cache entry, we'll just use the valid dest */
> > > + write_lock(&tbl->lock);
> > > + ip_vs_lblc_new(tbl, iph->daddr, dest);
> > > + write_unlock(&tbl->lock);
> > 
> > This write lock looks obviously correct as the destructors
> > do the same thing.
> > 
> > > +out:
> > >   IP_VS_DBG(6, "LBLC: destination IP address %u.%u.%u.%u "
> > >             "--> server %u.%u.%u.%u:%d\n",
> > > -           NIPQUAD(en->addr),
> > > +           NIPQUAD(iph->daddr),
> > >             NIPQUAD(dest->addr),
> > >             ntohs(dest->port));
> > >  
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

<Prev in Thread] Current Thread [Next in Thread>