Re: [RFC PATCH net-next 0/4] ipvs: check scheduler callback return value

To: Julian Anastasov <ja@xxxxxx>
Subject: Re: [RFC PATCH net-next 0/4] ipvs: check scheduler callback return values
Cc: <wensong@xxxxxxxxxxxx>, <horms@xxxxxxxxxxxx>, <lvs-devel@xxxxxxxxxxxxxxx>, <kernel-team@xxxxxx>
From: Alex Gartrell <agartrell@xxxxxx>
Date: Mon, 23 Feb 2015 23:51:05 -0800
Hey Julian,

Thank you for your quick and thoughtful response :)

On 2/23/15 11:30 PM, Julian Anastasov wrote:
We have a custom scheduler that is failing some atomic allocations

        You can use GFP_KERNEL from add_dest method.

That should definitely improve things for us.

        Can a new create_dest(svc, bool for_update) method
solve such problem? Such modifications are serialized, so
create_dest can attach new structure to sched_data just before
the add_dest/upd_dest call. IPVS does not clone dests on
update but scheduler may need this for RCU update.
create_dest should be called early in __ip_vs_update_dest.
Note that kfree(dest) is not enough, see ip_vs_dest_free(),
more fields may be allocated before __ip_vs_update_dest is

More context on the custom scheduler:

It's a consistent hash ring algorithm which requires sorting all of the members and then inserting them one at a time (a entry per weight). It requires a massive allocation to load balance to the number of machines we need to.

Right here I'm going to admit that this is not ideal. This is somewhat "inherited," but I'm currently stuck solving this problem so I'll make the best of it :)

So now we're in this position where adds, deletes, and updates all can force us to recalculate this giant hash ring. We can either simply lock the ring (but this actually takes long enough with a large number of real servers with large weights that it causes soft lock up detection!) or do RCU (a new allocation every time). We're trying out the RCU thing now which is where this hurt us.

One aside, if it's safe to call synchronize_rcu in that path, then it may solve our most painful allocation problem in part.

tl;dr; it's an implementation in which you allocate on add, update, and delete.

Patch 3:
        In the __ip_vs_update_dest change you are missing
        that ip_vs_rs_hash was called. In fact, I think we
        should call __ip_vs_unlink_dest(svc, dest, 0) and
        __ip_vs_del_dest(svc->net, dest, false)
        from ip_vs_add_dest() when both __ip_vs_update_dest(, 1)
        and ip_vs_new_dest() fail. This should work because
        add_dest call is the last thing we do on ADD, so
        the DEL operation but without calling del_dest should
        be ok. Such change invalidates Patch 1.

        Why above solution is preferred:

        At first look, scheduler that can return error from add_dest
        should not provide this dest to RCU readers via its schedule
        method and kfree(dest) looks ok.

        The problem is with the ip_vs_has_real_service() call
        and the ip_vs_rs_hash() call we do before add_dest,
        dest still can be accessed by RCU readers.
        For this reason we can not just kfree(dest).

Pending the rest of your reply I will reimplement it in this way.

Patch 4:
        As for del_dest result, failure should not happen,
we should be able to flush dests on module cleanup. I think, there
should not be any allocations in this case.

As mentioned above, we may need to reallocate in this path. But it does offend my sensibilities that this is the case, so I think it'd be fair to change the del_dest function to return void and we'd just find another way to do it.

Alex Gartrell <agartrell@xxxxxx>
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at

<Prev in Thread] Current Thread [Next in Thread>