|
Hello,
On Mon, 24 Nov 2025, Pablo Neira Ayuso wrote:
> On Sun, Oct 19, 2025 at 06:57:00PM +0300, Julian Anastasov wrote:
> > Some places walk the services under mutex but they can just use RCU:
> >
> > * ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock
> > to modify dest
> > * ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb
> > * ip_vs_genl_parse_service(): move RCU lock to callers
> > ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd()
> > * ip_vs_genl_dump_dests(): just fill skb
> >
> > Signed-off-by: Julian Anastasov <ja@xxxxxx>
> > ---
> > net/netfilter/ipvs/ip_vs_ctl.c | 47 +++++++++++++++++-----------------
> > 1 file changed, 23 insertions(+), 24 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 2fb9034b4f53..b18d08d79bcb 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -1759,23 +1759,21 @@ static int ip_vs_dst_event(struct notifier_block
> > *this, unsigned long event,
> > if (event != NETDEV_DOWN || !ipvs)
> > return NOTIFY_DONE;
> > IP_VS_DBG(3, "%s() dev=%s\n", __func__, dev->name);
> > - mutex_lock(&ipvs->service_mutex);
> > + rcu_read_lock();
>
First, thanks for the review!
> Control plane can still add destinations to svc->destinations that can
> be skipped by the rcu walk. I think it should be possible to trigger
> leaks with a sufficiently large list, given concurrent updates can
> happen. ip_vs_forget_dev() has its own lock, but this is a per-dest
> lock which is taken during the list walk.
Even if dest is added, its dest_dst will be allocated on
next packet. And this is not the real problem.
The race is different: the first flow
clears __LINK_STATE_START, IFF_UP and then notifies for
NETDEV_DOWN ip_vs_dst_event() and then FIB. During
ip_vs_dst_event() device is down but the routes are
not dead yet. The second flow can call do_output_route4() and
create again dest->dest_dst because the routes still
work, see fib_lookup_good_nhc(). Even adding new dest can
trigger this. It can happen even immediately after a
mutex unlock in ip_vs_dst_event(). And the problem is that
during the notification phase there is a gap where we can
attach new routes while the device is marked down
but it is not yet propagated to FIB.
As we hold implicit reference to dev via
dst, if due to race we miss to drop a route, the dev
will be replaced with blackhole_netdev on
NETDEV_UNREGISTER. So, the leak is until UNREGISTER
or until next packet that will catch the dead route
in __ip_vs_dst_check(). And if dest is deleted we
always drop this route in __ip_vs_dst_cache_reset(),
the leak is gone.
One way to make this rt caching more robust
is to add a netif_running() check together with the
IP_VS_DEST_F_AVAILABLE check that is under dst_lock after
the do_output_route4() call. By this way we synchronize
with both dest deletion and device closing. I'll
probably extend the patch with such change in the next
days, including the same for IPv6:
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index d57af95f1ebd..7b356ab8f439 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -295,6 +295,12 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
return true;
}
+/* rt has device that is down */
+static bool rt_dev_is_down(const struct net_device *dev)
+{
+ return dev && !netif_running(dev);
+}
+
/* Get route to destination or remote server */
static int
__ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb,
@@ -335,7 +341,8 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af,
struct sk_buff *skb,
* for very short period and it must be checked under
* dst_lock.
*/
- if (dest->flags & IP_VS_DEST_F_AVAILABLE)
+ if (dest->flags & IP_VS_DEST_F_AVAILABLE &&
+ !rt_dev_is_down(rt->dst.dev))
__ip_vs_dst_set(dest, dest_dst, &rt->dst, 0);
else
noref = 0;
> > @@ -3915,9 +3911,12 @@ static int ip_vs_genl_set_cmd(struct sk_buff *skb,
> > struct genl_info *info)
> > if (cmd == IPVS_CMD_NEW_SERVICE || cmd == IPVS_CMD_SET_SERVICE)
> > need_full_svc = true;
> >
> > + /* We use function that requires RCU lock */
> > + rcu_read_lock();
>
> This is ip_vs_genl_set_cmd path and the new per-netns mutex is held.
>
> I think __ip_vs_service_find() can now just access this mutex to check
> if it is held, using fourth parameter:
>
> list_for_each_entry_rcu(..., lockdep_is_held(&ipvs->service_mutex))
>
> Then this rcu_read_lock() after mutex_lock(&ipvs->service_mutex) can
> be removed. I suspect you added it to quiet a rcu debugging splat.
Yep, that is why the above comment. I'll check this...
Regards
--
Julian Anastasov <ja@xxxxxx>
|