LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [lvs-users] NFCT and PMTU

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [lvs-users] NFCT and PMTU
From: lvs@xxxxxxxxxx
Date: Mon, 10 Sep 2012 23:26:13 +0100 (BST)
Comments inline below.

Thanks
Tim

On Tue, 11 Sep 2012, Julian Anastasov wrote:

>
>       Hello,
>
> On Mon, 10 Sep 2012, lvs@xxxxxxxxxx wrote:
>
>> I have a number of LVS directors running a mixture of CentOS 5 and CentOS
>> 6 (running kernels 2.6.18-238.5.1 and 2.6.32-71.29.1). I have applied the
>> ipvs-nfct patch to the kernel(s).
>>
>> When I set /proc/sys/net/ipv4/vs/conntrack to 1 I have PMTU issues. When
>> it is set to 0 the issues go away. The issue is when a client on a network
>> with a <1500 byte MTU connects. One of my real servers replies to the
>> clients request with a 1500 byte packet and a device upstream of the
>> client will send an ICMP must fragment. When conntrack=0 the director
>> passed the (modified) ICMP packet on to the client. When conntrack=1 the
>> director doesn't send an ICMP to the real server. I can toggle conntrack
>> and watch the PMTU work and not work.
>
>       I can try to reproduce it with recent kernel.
> Can you tell me what forwarding method is used? NAT? Do
> you have a test environment, so that you can see what
> is shown in logs when IPVS debugging is enabled?
>
>       Do you mean that when conntrack=0 ICMP is forwarded
> back to client instead of being forwarded to real server?
>

I am using NAT as the forwarding method. I do have a test environment, 
with a CentOS 6 director.

When conntrack=0 the ICMP must-fragment is NATed and sent to the real 
server. Thus PMTU works perfectly. Sorry, I mean real server but wrote 
client.

>       Now I remember for some problems with ICMP:
>
> - I don't see this change in 2.6.32-71.29.1:
>
> commit b0aeef30433ea6854e985c2e9842fa19f51b95cc
> Author: Julian Anastasov <ja@xxxxxx>
> Date:   Mon Oct 11 11:23:07 2010 +0300
>
>    nf_nat: restrict ICMP translation for embedded header
>
>       Skip ICMP translation of embedded protocol header
>    if NAT bits are not set. Needed for IPVS to see the original
>    embedded addresses because for IPVS traffic the IPS_SRC_NAT_BIT
>    and IPS_DST_NAT_BIT bits are not set. It happens when IPVS performs
>    DNAT for client packets after using nf_conntrack_alter_reply
>    to expect replies from real server.
>
>    Signed-off-by: Julian Anastasov <ja@xxxxxx>
>    Signed-off-by: Simon Horman <horms@xxxxxxxxxxxx>
>
> diff --git a/net/ipv4/netfilter/nf_nat_core.c 
> b/net/ipv4/netfilter/nf_nat_core.c
> index e2e00c4..0047923 100644
> --- a/net/ipv4/netfilter/nf_nat_core.c
> +++ b/net/ipv4/netfilter/nf_nat_core.c
> @@ -462,6 +462,18 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
>                       return 0;
>       }
>
> +     if (manip == IP_NAT_MANIP_SRC)
> +             statusbit = IPS_SRC_NAT;
> +     else
> +             statusbit = IPS_DST_NAT;
> +
> +     /* Invert if this is reply dir. */
> +     if (dir == IP_CT_DIR_REPLY)
> +             statusbit ^= IPS_NAT_MASK;
> +
> +     if (!(ct->status & statusbit))
> +             return 1;
> +
>       pr_debug("icmp_reply_translation: translating error %p manip %u "
>                "dir %s\n", skb, manip,
>                dir == IP_CT_DIR_ORIGINAL ? "ORIG" : "REPLY");
> @@ -496,20 +508,9 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
>
>       /* Change outer to look the reply to an incoming packet
>        * (proto 0 means don't invert per-proto part). */
> -     if (manip == IP_NAT_MANIP_SRC)
> -             statusbit = IPS_SRC_NAT;
> -     else
> -             statusbit = IPS_DST_NAT;
> -
> -     /* Invert if this is reply dir. */
> -     if (dir == IP_CT_DIR_REPLY)
> -             statusbit ^= IPS_NAT_MASK;
> -
> -     if (ct->status & statusbit) {
> -             nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> -             if (!manip_pkt(0, skb, 0, &target, manip))
> -                     return 0;
> -     }
> +     nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> +     if (!manip_pkt(0, skb, 0, &target, manip))
> +             return 0;
>
>       return 1;
> }
>
>       If this patch does not help we have to debug it
> somehow.

I will give this a go tomorrow. I just need to find a client on a network 
with a <1500 byte MTU! I guess I will have to make one.

>
>> I would happily leave conntrack off, but it has a huge performance impact.
>> With my traffic profile the softirq load doubles when I turn off
>> conntrack. My busiest director is doing 2.1Gb of traffic and with
>> conntrack off it can probably only handle 2.5Gb.
>
>       It is interesting to know about such comparison
> for conntrack=0 and 1. Can you confirm again both numbers?
> 2.1 is not better than 2.5.

My busiest director is processing 2.1Gb of traffic in peak hour. My 
projection is with conntrack=0, softirq will hit 100% on one or more cores 
at 2.5Gb and the director will use ethernet flow control Xoffs to limit 
traffic throughput.

With conntrack=1 I project softirq won't reach 100% until the traffic 
level hits 5-6Gb. These projections are based on the softirq load for 
various traffic levels and assume load will scale linearly. Although 
linear scaling is a fairly safe assumption as I have used a traffic 
generator to test my directors (up to CentOS 5) and load scaled perfectly 
linearly (when using MSI-X interrupts). The traffic generated matched my 
production traffic in terms of latency, average packet size, number of 
source IPs, packet loss and most importantly CPU load per Mb. Sadly my 
traffic generator tops out at 600Mb with this real world traffic profile, 
but then it is just a linux server with some very expensive software.

Under CentOS 3 (traditional interrupts) with SMP affinity set to all cores 
(or rather half the cores for the external NIC and half for internal NIC) 
load scaled linearly until it fell off a cliff and load hit 100% and more 
generated traffic resulted in no more throughput (lots of Xoffs). I also 
have some old data showing NFCT improving performance on CentOS 3.

Looking at my monitoring graphs for one director when I flipped conntrack 
from 1 to 0 overall traffic in the peak hour stayed at 1.4Gb while softirq 
load on the busiest core rose from around 43% to around 62%. Average 
sotirq load across all cores rose from 27% to 40%. I realise these figures 
don't tie up with those higher up, but this is a different director with a 
different mix of services. I have another with no email doing 1.1Gb of 
traffic and only 15% softirq on the busiest core. Email is expensive to 
process!

>
>> I am hoping that this issue has been observed and fixed and someone will
>> be able to point me to the patch so I can back port it to my kernels (or
>> finally get rid of CentOS 5!).
>>
>> Thanks
>> Tim
>
> Regards
>
> --
> Julian Anastasov <ja@xxxxxx>
>
> _______________________________________________
> Please read the documentation before posting - it's available at:
> http://www.linuxvirtualserver.org/
>
> LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
> Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users

<Prev in Thread] Current Thread [Next in Thread>