Making LVS work with Netfilter's connection tracking
====================================================

The two attached patches modify the kernel and the ipvs modules
in such a way that ipvs NAT connections are correctly tracked by
the Netfilter connection-tracking code.  This means that
firewalling rules can be put in place to allow incoming
connections to a virtual service, and then by allowing
ESTABLISHED and RELATED packets to pass the FORWARD chain, we
achieve stateful firewalling of these connections.

For example, if director 4.3.2.1 is offering a virtual service
on TCP port 8899, we can do

iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED
iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED

and get the desired behaviour.  Note that the second rule (the
one in the FORWARD chain) covers all virtual services offered by
the same director, so if another service is offered on port
9900, the complete set of rules required would be

iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED
iptables -A INPUT -p tcp -d 4.3.2.1 --dport 9900 -m state --state NEW,ESTABLISHED,RELATED
iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED

i.e., one rule in the INPUT chain is required per virtual
service, but the rule in the FORWARD chain covers all virtual
services.


Patches required
================

Patch to main kernel source
---------------------------

There is a small change required to the kernel patch.  The stock
kernel patch which comes with the ip_vs distribution just adds a
few EXPORT_SYMBOL()s to ksyms.c.  For the Netfilter
connection-tracking functionality, we need a bit more.  The
files affected, and reasons, are:

ip_conntrack_core.c:
  init_conntrack():
    Mark even more clearly that the newly-created
    connection-tracking entry is not in the hash tables.  This
    change isn't strictly necessary but makes assertion-checking
    easier.
  
ip_conntrack_standalone.c:
  Export the symbol __ip_conntrack_confirm().  I didn't really
  like the idea of exporting a symbol starting with
  double-underscore, but nothing too bad seems to have happened.
  The function seems to take care of reference-counting, so I
  think we're OK here.

ip_nat_core.c:
  ip_nat_replace_in_hashes():  (new function)
    Exported wrapper round replace_in_hashes() which deals with
    the locking on ip_nat_lock.

ip_nat_standalone.c:
  Export the new ip_nat_replace_in_hashes() function.

ip_nat.h:
  Declare the new ip_nat_replace_in_hashes() function.

More explanation below.


Patch to ip_vs code
-------------------

ip_vs_app.c:
  skb_replace():
    Copy debugging information across to the new skb, if
    debugging is enabled.  This is a separate issue to the
    main connection-tracking patch, but was causing spurious
    warnings about which hooks a skb had passed through.

ip_vs_conn.c:
  Include some netfilter header files.

  Declare a new function ip_vs_deal_with_conntrack().

  ip_vs_nat_xmit():
    Code to make sure that Netfilter's connection-tracking entry
    is correct.

  ip_vs_deal_with_conntrack():  (new function)
    The guts of the new functionality.  Changes the data inside
    the Netfilter connection-tracking entry to match the actual
    packet flow.

ip_vs_core.c:
  route_me_harder():  (new function)
    Copied from ip_nat_standalone.c.  Code to re-make the
    routing decision for a packet, treated as locally-generated.

  ip_vs_out():
    Separate from the connection-tracking code changes, don't
    send ICMP unreachable messages.  This has been discussed on
    the list recently and I think the consensus was that this
    change is OK.  The sysctl method would be better though, so
    ignore this bit.

    Also call route_me_harder() to decide whether the outbound
    packet needs to be routed differently now it is supposed to
    be coming from the director machine itself.

  ip_vs_in():
    When checking if a packet might be trying to start a new
    connection, check that it has SYN but not ACK.  Previously,
    the only check was that it had SYN set.

    If there is a new connection being attempted, check for
    consistency between Netfilter's connection-tracking table
    and LVS'.  More explanation of this bit below.

ip_vs_ftp.c:
  Include the Netfilter header files.

  Declare new function ip_vs_ftp_expect_callback().

  ip_vs_ftp_out():
    Once we have noticed that a passive data-transfer connection
    has been negotiated at application level, tell Netfilter to
    expect this connection and so treat it as RELATED.

  ip_vs_ftp_in():
    Once we have noticed that an active data-transfer connection
    has been negotiated at application level, tell Netfilter to
    expect this connection and so treat it as RELATED.

  ip_vs_ftp_expect_callback():  (new function)
    When the RELATED packet arrives (for a data-transfer
    connection), update Netfilter's connection-tracking entry
    for the connection.


Explanation
===========

General connections (i.e., not FTP)
-----------------------------------

Each entry in Netfilter's connection-tracking table has two
tuples describing source and destination addresses and ports.
One of these tuples is the ORIG tuple, and describes the
addressing of packets travelling in the "original" direction,
i.e., from the machine that initiated the connection to the
machine that responded.  The other is the REPLY tuple, which
describes the addressing of packets travelling in the "reply"
direction, i.e., from the responding machine to the initiating
machine.  Normally, the REPLY tuple is just the "inverse" of the
ORIG tuple, i.e., has its source and destination reversed.  But
for LVS connections, this is not the case.  This is what causes
the problem when using the unmodified Netfilter code with IPVS
connections.  Actually, it's one of the things that causes
trouble.

The following is roughly what happens with the unmodified code
for the start of a TCP connection to a virtual service.  Suppose
we have


   +--------+
   | Client |
   +--------+
     (CIP)       <-- Client's IP address
       |
       |
  { internet }  
       |
       |
     (VIP)       <-- Virtual IP address
  +----------+
  | Director |
  +----------+
     (PIP)       <-- (Director's Private IP address)
       |
       |
     (RIP)       <-- Real (server's) IP address
 +-------------+
 | Real server |
 +-------------+


Then the client sends a packet to the VIP:VPORT; say

  CIP:CPORT -> VIP:VPORT

Netfilter on the director makes a note of this packet, and sets
up a temporary connection-tracking entry with tuples as follows:

  ORIG:  CIP:CPORT -> VIP:VPORT
  REPL:  VIP:VPORT -> CIP:CPORT

(the "src-ip:src-port -> dest-ip:dest-port" notation is
hopefully clear enough).  We will call a connection-tracking
entry a "CTE" from now on.

LVS notices (in ip_vs_in(), called as part of the LOCAL_INPUT
hook) that VIP:VPORT is something it's interested in, grabs the
packet, re-writes it to be addressed

  CIP:CPORT -> RIP:RPORT

and sends it on its way by means of ip_send().  As a result, the
POST_ROUTING hook gets called, and ip_vs_post_routing() gets a
look at the packet.  It notices that the packet has been marked
as belonging to LVS, and calls the (*okfn), sending the packet
to the wire without further ado.

When it has been transmitted, the reference count on the CTE
falls to zero, and it is deleted.  (This is a mild guess but I
think is right.)  Normally, CTEs avoid this fate because
__ip_conntrack_confirm() is called for them, either via
ip_confirm() as a late hook in LOCAL_IN, or through ip_refrag()
called as a late hook in POST_ROUTING.  "Confirming" the CTE
involves linking it into some hash tables, and ensuring it isn't
deleted.

So this is the first problem --- the CTE is not "confirmed".

Suppose we confirmed the connection.  Then when the Real Server
replies to this packet, it sends a packet addressed as

  RIP:RPORT -> CIP:CPORT

to the director (because the Director is the router for such
packets, as seen by the Real Server).  Then the
connection-tracking code in Netfilter on the director tries to
look up the CTE for this packet, but can't find one.  The CTE we
/want/ it to match says

  ORIG:  CIP:CPORT -> VIP:VPORT
  REPL:  VIP:VPORT -> CIP:CPORT

with no mention of the RIP:RPORT.  So this reply packet gets
labelled as "NEW", whereas we wanted it to be labelled as
"ESTABLISHED".

So as well as confirming the CTE, we also need to alter the
REPLY tuple so that it will match the

  RIP:RPORT -> CIP:CPORT

packet the Real Server sends back.  Then everything will work.

These two things are what the ip_vs_deal_with_conntrack()
function does.  Luckily there is a ip_conntrack_alter_reply()
function exported by Netfilter, which we can use.  Then we can
also call the newly-exported __ip_conntrack_confirm() to confirm
the connection.  (We need to do the reply altering first because
__ip_conntrack_confirm()ing the CTE uses the addresses in the
ORIG and REPLY tuples to place the CTE in the hash tables, and
we want it placed based on the /new/ reply tuple.)

There is a slight complication in that the NAT code in Netfilter
gets confused if addressing tuples change, so we need to tell
the NAT code to re-place the CTE in its hash tables.  This is
done with the newly-exported ip_nat_replace_in_hashes()
function.

The ip_vs_deal_with_conntrack() function is called from the
ip_vs_nat_xmit() function, since this whole problem only applies
to LVS-NAT.  It is only called if the CTE is unconfirmed.


Hacking round a possible race
-----------------------------

When testing this, we found that very occasionally there would
be a problem when the Netfilter CTE timed out and was deleted.
The code would fail an assertion: the CTE about to be deleted
was not linked into the hash chain it claimed it was.  This
would happen after a few tens of thousands of connections from
the same client to the same virtual service.

We tracked this down to the above ip_vs_deal_with_conntrack()
code being called for a CTE which already existed and was
already confirmed.  Doing this moved the CTE to a different hash
chain and broke things.

The only explanation I could come up with is that there is a
race in the ip_vs code.  The ip_vs code doesn't set up one timer
per connection entry.  Instead, it uses a kernel timer to do
some work every second.  I didn't look into this too deeply, but
it looked like the following is a possibility.

If the slow-timer code decides that a LVS connection should be
expired, there seems to be a window where a packet can arrive
and update that connection, meaning that it should no longer be
expired.  But it is anyway.  There are more details; supplied on
request.  But if somebody who knows the timer code could check
whether the above is a possibility, and fix it if so, that would
be good.

The workaround detects if the CTE is already confirmed, and
deletes it and also drops the packet if so.  Higher levels in
the stack take care of retransmitting so nothing too drastic
goes wrong.

Later, we noticed the workaround being triggered much more often
than we'd expect, and it turned out that incoming packets with
the SYN and ACK bits both set were being treated as potentially
starting new connections, whereas SYN/ACK packets are in fact a
response to a connection initiated by the director itself.  So
we tightened the test to be

   ((h.th->syn && !h.th->ack) || (iph->protocol != IPPROTO_TCP))

instead of

   (h.th->syn || (iph->protocol!=IPPROTO_TCP))

which is how it is in the original LVS code.  This doesn't seem
to have caused any nasty side effects.

Note that this only happened when an FTP virtual service was
configured, because of the code in ip_vs_service_get() which
allows a "wild-card" match for incoming FTP data connections.


FTP connections
---------------

The other main change is to the LVS FTP module.  We add code to
the two functions ip_vs_ftp_out() and ip_vs_ftp_in(), to deal
with passive and active data transfers respectively.  The basic
idea is the same for both types of transfer.

By keeping an eye on the actual traffic going between the client
and the FTP server, we can tell when a data transfer is about to
take place.  For a passive transfer, the ip_vs_ftp module looks
out for the string "227 Entering Passive Mode" followed by the
address and port the server will listen on.  For an active
transfer, the client transmits the "PORT" command followed by
the address and port the client will listen on.

Once we have detected that a data transfer is about to take
place, we add code to tell Netfilter's connection-tracking code
to /expect/ the data connection.  Then, packets belonging to the
data connection will be labelled "RELATED" and can be allowed by
firewall rules.  There is an exported function
ip_conntrack_expect_related(), which we call.  The only
difference between the set-up for passive and active transfers
is that for passive transfers we don't know the port the client
will connect from, so have to specify the source port as "don't
care" by means of its mask.

The ip_conntrack_expect_related() function allows us to specify
a callback function; we use ip_vs_ftp_expect_callback() (new
function in this patch).  ip_vs_ftp_expect_callback() works out
whether the new connection is for passive or active, modifies
the REPLY tuple, and confirms the CTE.

I've just noticed that I modify the reply tuple directly instead
of calling ip_conntrack_alter_reply().  Can't see any good
reason for this, so should probably change the code to use
ip_conntrack_alter_reply() instead.  Might not have time to test
that change here, so will leave it alone for now.

So to run a virtual FTP service, load the extra ip_vs_ftp
module, but /not/ the ip_conntrack_ftp or ip_nat_ftp modules.
It is very likely that the ip_vs_ftp module would not cooperate
very well with those two modules, so if you want to run
a non-virtual FTP service /and/ load-balance a virtual FTP
service on the same machine, more work might be required.


route_me_harder()
-----------------

We call this function to possibly re-route the packet, because
we were using policy routing (iproute2).  This allows routing
decisions to depend on more than just the destination IP address
of the packet.  In particular, a routing decision can be
influenced by the source IP address of the packet, and by the
fact that the packet should be treated as originating with the
local machine.  The call to route_me_harder() re-makes the
routing decision in light of the new state of the packet.  It
could be removed (or disabled via a sysctl) if the overhead was
too annoying in an application which didn't require this extra
flexibility.


Additional #defines
-------------------

There are additional #defines available to add
assertion-checking and various amounts of debugging to the
output of the new code.

#define BN_ASSERTIONS to include extra code which checks various
things are as they should be.  This adds a small amount of
overhead (sorry, haven't measured it) but caught some problems
in development.

#define BN_DEBUG_FTP to emit diagnostic and tracing information
from the modified ip_vs_ftp module.  Again, was useful during
development but probably not useful in production.

#define BN_DEBUG_IPVS_CONN to emit diagnostic and tracing
information from the new code which handles Netfilter's CTEs.
Same comments apply: useful while I was working on it, but
probably not in actual use.


Copyright and Licence
=====================

    This patch is Copyright (C) 2001--2002
    Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.

    This code is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


Contact: ben@xxxxxxxxxxxxx or glen@xxxxxxxxxxxxx

--Ben North, 22 January, 2002.