Re: PMTU-D: remember, your load balancer is broken (fwd)

To: Wensong Zhang <wensong@xxxxxxxxxxxx>
Subject: Re: PMTU-D: remember, your load balancer is broken (fwd)
Cc: Kyle Sparger <ksparger@xxxxxxxxxxxxxxxxxxxx>, lvs-users@xxxxxxxxxxxxxxxxxxxxxx
From: Julian Anastasov <uli@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 20 Jun 2000 08:59:11 +0300 (EEST)

On Mon, 19 Jun 2000, Wensong Zhang wrote:

> >     Yes,  in 2.2 the  packets must be  restored which is
> > not  good practice but this  can hurt only masq applications
> > which change data (which is may be not fatal, I'm not sure).
> > 
> It depends. We have to make sure that ip_fw_unmasq_icmp doesn't restore
> packets that don't need to be restored. Otherwise, it might hurt some
> other progarms.

        Yes, this is already ok.

> > > Yeah, I agree with you. The netfilter probably have problems in calling
> > > icmp_send for already-mangled packets. I think it is a need to restore ip
> > > header of the packet before calling icmp_send.
> > 
> >     I  think,  with  Netfilter  many  things  look  well
> > structured   but  I'm  not  sure  for  the  ICMP_FRAG_NEEDED
> > generation.   The  packet  restoring  must  be  avoided,  if
> > possible,  if it involves changes in the protocol data, i.e.
> > not  very good for  masq apps. I assume  this is not planned
> > in the 2.3 world without masq apps.
> > 
> Yup, restoring packet is difficult, especially for restoring data (the
> first 64 data bits of the datagram). The packet restoring should be
> avoided as far as possible, but I am not sure that all the error can be
> detected before mangling packets.

        For now only the PMTU raises this problem.

> I see that mangling packets will introduce many other problems, I now
> like the VS/DR and VS/TUN more. :)


> [snip]
> > 
> >     I think, we have to answer these questions for 2.3:
> > 
> > - should  we use header restoring or not. It must be planned
> > with  the masq  apps support if  any, i.e.  whether the data
> > will  be changed too. It is  very difficult to restore data.
> > For the header it is easy.
> > 
> > - how  can we call each hook  from PRE_ROUTING to revert its
> > header  or data changes if  each this hook returns NF_ACCEPT
> > instead  of NF_STOLEN. It is not possible.
> > 
> I think that we better try to avoid restoring packet if possible,
> because if the modified data is in the first 64 data bits and we cannot
> restore it to the original data, the icmp message still cannot notify
> the client correctly.

        In fact, the our address is ignored when sending the packet
to the client. With a little magic icmp_send delivers the packet
back to the client but with wrong encapsulated address. The result:
may the client can't determine the related connection. We send
raddr instead of maddr, sometimes 192.168.X.Y :)

> If we cannot avoid restoring packets, the only method seems that we have
> to duplicate the original packet before the mangled packet is sent out
> correctly. But, it will introduce too much overhead.


> >     The result:
> > 
> > - don't try to restore header from icmp_send
> > 
> > - if  something is  changed the  hook must  return NF_STOLEN
> > and  process the  packet: routing,  mtu check,  mangling and
> > forwarding
> > 
> > - return  ICMP_FRAG_NEEDED  before  mangling.   Here  is the
> > problem  (not for LVS),  we must know  the output device mtu
> > before  mangling.   But the  kernels call  ip_forward() when
> > the packets are ready to send, i.e. after mangling without a
> > way to restore them.
> > 
> >     At  least, these thoughts  don't correspond with the
> > current packet filter hooks and the packet forwarding.
> > 
> >     But  may be I'm  missing something. If  the above is
> > correct  the  "ext_mtu  >  int_mtu"  problem  can  break any
> > design. LVS have to do the steps ignoring the current kernel
> > structure.   This will improve the  speed, though. The other
> > way  is just not to solve this  problem. This can be bad for
> > some guys with external gigabits and many internal megabits.
> > Is that true?
> > 
> I agree that we better locate errors before mangling packet if possible.
> For VS/TUN and VS/DR in IPVS for kernel 2.3, we have already skipped
> some hooks and sent the packet immediately. For LVS/NAT, maybe we can
> try to detect all the errors before mangling packet, if there is no
> problem, we can also send the mangled packet immedidately, then the
> hooks such as ip_forward will not do some error detection again. Anyway,
> we will see.

        I'm not sure if our in_get() call can replace the
routing cache lookup in ip_route_input for the LVS connections.
We will skip a hash table lookup for the next packets and only
the first packet will be checked with ip_route_input. May be
the time to lookup the route cache is same as the time to lookup
with in_get (which we always call). I.e. if we play in the
pre_routing we can safe (n-1)/n of the calls to ip_route_input.
But we can't call ip_forward without calling ip_route_input.
This can result in skipping the FORWARD hook. I have a very
bad idea: sysctl var, module option, compile-time define to
select FAST and SLOW mode for LVS in the Netfilter framework.
FAST: do all work (skip ip_route_input, play as FORWARD filter,
etc.), SLOW: stay in LOCAL_IN, call ip_output to check MTU,
may be before ip_forward. But there must be a flag, we have
to skip connection table lookups in the FORWARD chain. Very
bad. All these troubles for mtu. If we want to forward
gigabits we must make some compromises with the packet
filtering. Or may be we can call the NF_HOOKs? May be
we can't register in the Netfilter hooks: better to
hook ip_forward just like the dumb nat? And to change the
header after the mtu check and before the NF_HOOK?
The is the only good place to demasq and masq. And
we better not to hook pre_routing but to call ip_route_input
with daddr=raddr for our connections. The result: may
be we can hook ip_rcv_finish() and to call ip_route_input
with different daddr (after schedule)? We can use skb
flag(s) too.


Julian Anastasov <uli@xxxxxxxxxxxxxxxxxxxxxx>

<Prev in Thread] Current Thread [Next in Thread>