LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Large HTTP GET/POST revisited (and solved)

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Large HTTP GET/POST revisited (and solved)
From: Casey Zacek <cz@xxxxxxxxxxxx>
Date: Fri, 11 Mar 2005 00:21:46 -0600
This is an old topic, and there has been much debate on the solution.
For those that aren't aware of it, the description can be found here:

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-Tun.html

Section 7.7. packets bigger than MTU.

I've emailed about this before, and nothing we ever came up with
really worked apparently.  The real problem I've always had is that
I've never had a means for duplicating it (possibly because I didn't
fully understand the problem -- I can probably duplicate it at will
now), and my customers have eventually just either accepted it and
moved on or changed to an LVS-NAT environment.

Well, I finally came across someone whose home network was setup in
such a way as to experience the "problem", so I decided to figure it
out once and for all and hopefully end all the confusion.

tcpdump is my friend.  I started out running tcpdump on the director:

23:13:52.804610 IP (tos 0x0, ttl 116, id 26413, offset 0, flags [DF], length: 
48) CLIENT-IP.60964 > VIRTUAL-IP.80: S [tcp sum ok] 3288780265:3288780265(0) 
win 65535 <mss 1452,nop,nop,sackOK>
23:13:52.810423 IP (tos 0x0, ttl 116, id 26415, offset 0, flags [DF], length: 
40) CLIENT-IP.60964 > VIRTUAL-IP.80: . [tcp sum ok] 3288780266:3288780266(0) 
ack 2303765635 win 65535
23:13:52.813943 IP (tos 0x0, ttl 116, id 26416, offset 0, flags [DF], length: 
602) CLIENT-IP.60964 > VIRTUAL-IP.80: P [tcp sum ok] 0:562(562) ack 1 win 65535
23:13:52.820802 IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 
1492) CLIENT-IP.60964 > VIRTUAL-IP.80: . [tcp sum ok] 562:2014(1452) ack 1 win 
65535
23:13:52.820887 IP (tos 0xc0, ttl  64, id 25185, offset 0, flags [none], 
length: 576) VIRTUAL-IP > CLIENT-IP: icmp 556: VIRTUAL-IP unreachable - need to 
frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], 
length: 1492) CLIENT-IP.60964 > VIRTUAL-IP.80: . 562:2014(1452) ack 1 win 65535
23:13:52.827175 IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 
1492) CLIENT-IP.60964 > VIRTUAL-IP.80: . [tcp sum ok] 2014:3466(1452) ack 90 
win 65446
23:13:52.827251 IP (tos 0xc0, ttl  64, id 25186, offset 0, flags [none], 
length: 576) VIRTUAL-IP > CLIENT-IP: icmp 556: VIRTUAL-IP unreachable - need to 
frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], 
length: 1492) CLIENT-IP.60964 > VIRTUAL-IP.80: . 2014:3466(1452) ack 90 win 
65446
23:13:52.833420 IP (tos 0x0, ttl 116, id 26420, offset 0, flags [DF], length: 
1492) CLIENT-IP.60964 > VIRTUAL-IP.80: . [tcp sum ok] 3466:4918(1452) ack 90 
win 65446

The tcp [DF] CLIENT->VIRTUAL (packet length 1492 -- too big), then
IPVS's ICMP response continue until the request eventually times out.
This message is generated every time one of the ICMP responses are
sent:

IPVS: ip_vs_tunnel_xmit(): frag needed

The problem comes when the ICMP host-unreachable (change MTU) packets
are ignored/dropped and not acted-upon by the client.  This is a more
common situation than I thought would be the case.

A few hours of debugging later, I realized that the SYN+ACK packet,
the response from the real server to continue the connection
handshake, is missing.  Duh.  I moved my tcpdumping to a tap in the
network that I knew would get all of the traffic.  The SYN+ACK
packet establishes the MSS (max segment size -- the data segment
size for the packets for this connection) to 1452, just as the
client machine requests (the first packet in the earlier trace).

Duh!  I had read all the stuff on the URL above, and this paragraph
comes closest to describing the solution:

    Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004 

    You have to change the mtu value on the end of the IP tunnel that
    initiates the tunnel i.e. the realserver (in this instance, a W2K
    box). This value should be close to the mtu value of the physical
    interface it is going through, but small enough to ensure there is
    enough space left for the ipip header. We use 1400 and have never
    had any reports of it failing. To do this you goto registry and
    add a dword entry called MTU with the decimal value 1400 (safe)
    into 

    hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of 
ip tunnel}

    reboot

In reality, it's not "the end of the IP tunnel that initiates the
tunnel" because the tunnel interface on the W2k box doesn't initiate
anything -- it only receives forwarded traffic from the director.
What he really means is "the interface on the real server that is
handshaking the TCP connection with the client."  The goal is to get
the client to send smaller packets so that they'll make it on to the
realserver.

CLIENT sends SYN to DIRECTOR
DIRECTOR encapsulates SYN packet in IPIP tunnel; sends to REALSERVER
REALSERVER receives SYN packet on LOOPBACK interface
REALSERVER sends SYNACK to CLIENT from LOOPBACK interface w/ MSS=1452
CLIENT sends ACK to DIRECTOR, on to REALSERVER
REALSERVER responds to CLIENT from LOOPBACK
repeat until dead

So, we have to change that MSS that gets sent back from realserver
to client.  That is, set the MTU on the loopback interface on the
Win2k box.

The solution is to do exactly what Chris Paul Chris said, except
change:

hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of ip 
tunnel}

to:

hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of MS 
Loopback Adapter}

After all, if you set an MTU in the IP tunnel interface this way, it
won't be there after you reboot, I've found.

Oh, and 1480 is the magic number.  1400 is safe, but 1480 works.
Any higher than that, and it doesn't work as desired.

So I went to investigate how to do the same thing on my Linux real
servers, only to find that the tunl0 interface, which is the
connection endpoint for Linux realservers, already has an MTU of 1480.
I don't know when that got fixed, but I guess I won't worry about it.

What really sucks is that now that I "get it," it seems like I should
have figured this out ages ago.

-- 
Casey Zacek
Senior Engineer
NeoSpire, Inc.

<Prev in Thread] Current Thread [Next in Thread>