LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

MASQ/LVS connection state handling (was Re: help,question)

To: yphu <yphu@xxxxxxxxxxxx>
Subject: MASQ/LVS connection state handling (was Re: help,question)
Cc: <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>, ratz <ratz@xxxxxx>
From: Julian Anastasov <ja@xxxxxx>
Date: Sun, 24 Dec 2000 01:09:38 +0000 (GMT)
        Hello,

On Mon, 11 Dec 2000, yphu wrote:

> Hi,
> I am a Linux Learner.
>
> I am learing the virtual server now.I have a question that I want ask.I
> hope I may receive your help.:)
> The question is :As we using Hash Table to record an established network
> connection,how do we know the data transmission by one conection is over
> and when should we delete it from the Hash Table? Can you talk about it
> in detail?

        OK, here we'll analyze the LVS and mostly the MASQ transition
tables from net/ipv4/ip_masq.c. LVS support adds some extensions to
the original MASQ code but the handling is same.

        First, we have three protocols handled: TCP, UDP and ICMP.
The first one (TCP) has many states and with different timeout values,
most of them set to reasonable values corresponding to the
recommendations from some TCP related rfc* documents. For UDP and ICMP
there are other timeout values that try to keep the both ends connected
for reasonable time without creating many connection entries for each
packet.

        There are some rules that keep the things working:

- when a packet is received for existing connection or when new
connection is created a timer is started/restarted for this connection.
The timeout used is selected according to the connection state.
If a packet is received for this connection (from one of the both ends)
the timer is restarted again (and may be after a state change). If no
packet is received during the selected period of time the masq_expire()
function is called to try to release the connection entry. It is
possible masq_expire() to restart the timer again for this connection
if it is used from other entries. This is the case for the templates
used to implement the persistent timeout. They occupy one entry
with timer set to the value of the persistent time interval. There
are other cases, mostly used from the MASQ code, where a helper
connections are used and masq_expire() can't release the expired
connection because it is used from others.

- according to the direction of the packet we distinguish two cases:
INPUT where the packet comes in demasq direction (from the world)
and OUTPUT where the packet comes from internal host in masq direction.

        masq, masq. What means masq direction for packets that are
not translated using NAT (masquerading), for example, for
Direct Routing or Tunneling. The short answer is: there is no
masq direction for these two forwarding methods. It is explained
in the LVS docs. In short, we have packets in both directions
when NAT is used and packets only in one direction (INPUT) when
DR or TUN are used. The packets are not demasqueraded for DR and TUN
method. LVS just hooks the LOCAL_IN chain as the MASQ code is
privileged in Linux 2.2 to inspect the incoming traffic when the
routing decides that the traffic must be delivered locally. After some
hacking the demasquerading is avoided for these two methods, of course,
after some changes in the packet and in its next destination - the
real servers. Don't forget that without LVS or MASQ rules these packets
hit the local socket listeners.

        How the connection states are changed. Let's analyze for
example the masq_tcp_states table (yes, we analyze the TCP states
as for UDP and ICMP it is trivial). The columns specify the current
state. The rows explain the TCP flag used to select the next TCP
state and its timeout. The TCP flag is selected from masq_tcp_state_idx().
This function analyzes the TCP header and decides which flag (if many
are set) is meaningful for the transition. The row (flag index) in the
state table is returned. masq_tcp_state() is called to change ms->state
according to the current ms->state and the TCP flag looking in the
transition table. The transition table is selected according to
the packet direction: INPUT, OUTPUT. This helps us to react differently
when the packets come from different directions. This is explained later
but in short the transitions are separated in such way (between INPUT
and OUTPUT) that transitions to states with longer timeouts are
avoided when they are caused from packets coming from the world.
Everyone understands the reason for this: the world can flood us with
many packets that can eat all the memory in our box. And this is the
reason for this complex scheme of states and transitions. The
ideal case is when there is no different timeouts for the different
states and when we use one timeout value for all TCP states as in UDP
and ICMP. Why not one for all these protocols. But the world is not
ideal. We try to give more time for the established connections and
if they are active (they don't expire in these 15 mins we give them
by default) they can live forever (at least to the next kernel
crash^H^H^H^H^Hupgrade).

        How the LVS extends this scheme. For the DR and TUN method
we have packets coming from the world only. We don't use the OUTPUT
table to select the next state (we don't see packets coming from the
internal hosts). We are in need to relax our INPUT rules and to
switch to the state required from the external hosts :( No more
transitions driven from the trusted internal hosts. Only box busy
from SYN floods or the more dangerous two-packet sequence of
SYN and ACK packets switching the connections to established state,
the state with the longest timeout.

        For these two methods LVS introduces one more transition
table: the INPUT_ONLY table which is used for the connections created
for the DR and TUN forwarding methods. The main goal: don't enter
established state too easy - make it harder.

        Oh, may be you're just reading the TCP specifications. There are
sequence numbers that the both ends attach to each TCP packet. And you
don't see the masq or LVS code to try to filter the packets according to
the sequence numbers. This can be fatal for some connections as the
attacker can cause state change by hitting a connection with RST
packet, for example (ES->CL). The only needed info for this kind of
attack is the source and destination IP addresses and ports. Such kind
of attacks are possible but not always fatal for the active connections.
The MASQ code tries to avoid such attacks by selecting minimal timeouts
that are enough for the active connections to resurrect. For example,
if the connection is hit by TCP RST packet from attacker, this
connection has 10 seconds to give an evidence for its existance
by passing an ACK packet through the masq box.

        To make the things complex and harder for the attacker to
block the masq box with many established connections the LVS extends
more the NAT mode (INPUT and OUTPUT tables) by introducing the
internal server driven state transitions: the secure_tcp defense
strategy. When enabled the TCP flags in the client's packet can't
cause switching to established state without acknowledgement from
the internal end of this connection. secure_tcp changes the
transition tables and the state timeouts to achieve this goal. And
the mechanism is very simple: keep the connection is SR state with
timeout 10 seconds instead of the default 60 seconds when the
secure_tcp is not enabled. This trick relies on the different
defense power in the real servers. If they don't implement SYN
cookies and so sometimes don't send SYN+ACK because the incoming
SYN is dropped from their full backlog queue, the connection expires
in LVS after 10 seconds assuming this is connection created from attacker
after one SYN packet not followed from another one as part from the
retransmissions provided from the client's TCP stack.

        The good news is that all these timeout values can be
changed from the LVS users but only when the secure_tcp strategy
is enabled. SR timeout of 2 seconds is a very good value for
LVS clusters with real servers that don't implement SYN cookies:
no SYN+ACK from the real server - drop the entry in the LVS box.

        The bad news are, of course, for the DR and TUN methods.
They can't benefit from the internal server driven mechanism.
There are other defense strategies that help when using these
methods. And, of course, all these defense strategies keep the
LVS box with free memory for more new connections. There is no
known way to pass only the valid requests to the internal servers.
This is because the real servers don't provide information to
the LVS box and we don't know which packet is dropped or accepted
from the socket listener. We can know this only by receiving ACK
packet from the internal server when the three-way handshake is
completed and the client is identified from the internal server
as valid client, not as spoofed one. Possible only for the NAT method.

HTH!

> Thanks a lot!
>
> Best regards
>
> ypHu


P.S. Ratz, please surprise me with other wonders ASAP, I just understood
what means "HTH" and it was just appended to my list of words
downloaded from tux :)

BTW, For my defense I just executed
mv /usr/bin/dc /usr/bin/dc.ratz_can_damage_your_brain, so no more
dc wonders, please :)


Regards

--
Julian Anastasov <ja@xxxxxx>



<Prev in Thread] Current Thread [Next in Thread>