Making LVS work with Netfilter's connection tracking ==================================================== The two attached patches modify the kernel and the ipvs modules in such a way that ipvs NAT connections are correctly tracked by the Netfilter connection-tracking code. This means that firewalling rules can be put in place to allow incoming connections to a virtual service, and then by allowing ESTABLISHED and RELATED packets to pass the FORWARD chain, we achieve stateful firewalling of these connections. For example, if director 4.3.2.1 is offering a virtual service on TCP port 8899, we can do iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED and get the desired behaviour. Note that the second rule (the one in the FORWARD chain) covers all virtual services offered by the same director, so if another service is offered on port 9900, the complete set of rules required would be iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED iptables -A INPUT -p tcp -d 4.3.2.1 --dport 9900 -m state --state NEW,ESTABLISHED,RELATED iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED i.e., one rule in the INPUT chain is required per virtual service, but the rule in the FORWARD chain covers all virtual services. Patches required ================ Patch to main kernel source --------------------------- There is a small change required to the kernel patch. The stock kernel patch which comes with the ip_vs distribution just adds a few EXPORT_SYMBOL()s to ksyms.c. For the Netfilter connection-tracking functionality, we need a bit more. The files affected, and reasons, are: ip_conntrack_core.c: init_conntrack(): Mark even more clearly that the newly-created connection-tracking entry is not in the hash tables. This change isn't strictly necessary but makes assertion-checking easier. ip_conntrack_standalone.c: Export the symbol __ip_conntrack_confirm(). I didn't really like the idea of exporting a symbol starting with double-underscore, but nothing too bad seems to have happened. The function seems to take care of reference-counting, so I think we're OK here. ip_nat_core.c: ip_nat_replace_in_hashes(): (new function) Exported wrapper round replace_in_hashes() which deals with the locking on ip_nat_lock. ip_nat_standalone.c: Export the new ip_nat_replace_in_hashes() function. ip_nat.h: Declare the new ip_nat_replace_in_hashes() function. More explanation below. Patch to ip_vs code ------------------- ip_vs_app.c: skb_replace(): Copy debugging information across to the new skb, if debugging is enabled. This is a separate issue to the main connection-tracking patch, but was causing spurious warnings about which hooks a skb had passed through. ip_vs_conn.c: Include some netfilter header files. Declare a new function ip_vs_deal_with_conntrack(). ip_vs_nat_xmit(): Code to make sure that Netfilter's connection-tracking entry is correct. ip_vs_deal_with_conntrack(): (new function) The guts of the new functionality. Changes the data inside the Netfilter connection-tracking entry to match the actual packet flow. ip_vs_core.c: route_me_harder(): (new function) Copied from ip_nat_standalone.c. Code to re-make the routing decision for a packet, treated as locally-generated. ip_vs_out(): Separate from the connection-tracking code changes, don't send ICMP unreachable messages. This has been discussed on the list recently and I think the consensus was that this change is OK. The sysctl method would be better though, so ignore this bit. Also call route_me_harder() to decide whether the outbound packet needs to be routed differently now it is supposed to be coming from the director machine itself. ip_vs_in(): When checking if a packet might be trying to start a new connection, check that it has SYN but not ACK. Previously, the only check was that it had SYN set. If there is a new connection being attempted, check for consistency between Netfilter's connection-tracking table and LVS'. More explanation of this bit below. ip_vs_ftp.c: Include the Netfilter header files. Declare new function ip_vs_ftp_expect_callback(). ip_vs_ftp_out(): Once we have noticed that a passive data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED. ip_vs_ftp_in(): Once we have noticed that an active data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED. ip_vs_ftp_expect_callback(): (new function) When the RELATED packet arrives (for a data-transfer connection), update Netfilter's connection-tracking entry for the connection. Explanation =========== General connections (i.e., not FTP) ----------------------------------- Each entry in Netfilter's connection-tracking table has two tuples describing source and destination addresses and ports. One of these tuples is the ORIG tuple, and describes the addressing of packets travelling in the "original" direction, i.e., from the machine that initiated the connection to the machine that responded. The other is the REPLY tuple, which describes the addressing of packets travelling in the "reply" direction, i.e., from the responding machine to the initiating machine. Normally, the REPLY tuple is just the "inverse" of the ORIG tuple, i.e., has its source and destination reversed. But for LVS connections, this is not the case. This is what causes the problem when using the unmodified Netfilter code with IPVS connections. Actually, it's one of the things that causes trouble. The following is roughly what happens with the unmodified code for the start of a TCP connection to a virtual service. Suppose we have +--------+ | Client | +--------+ (CIP) <-- Client's IP address | | { internet } | | (VIP) <-- Virtual IP address +----------+ | Director | +----------+ (PIP) <-- (Director's Private IP address) | | (RIP) <-- Real (server's) IP address +-------------+ | Real server | +-------------+ Then the client sends a packet to the VIP:VPORT; say CIP:CPORT -> VIP:VPORT Netfilter on the director makes a note of this packet, and sets up a temporary connection-tracking entry with tuples as follows: ORIG: CIP:CPORT -> VIP:VPORT REPL: VIP:VPORT -> CIP:CPORT (the "src-ip:src-port -> dest-ip:dest-port" notation is hopefully clear enough). We will call a connection-tracking entry a "CTE" from now on. LVS notices (in ip_vs_in(), called as part of the LOCAL_INPUT hook) that VIP:VPORT is something it's interested in, grabs the packet, re-writes it to be addressed CIP:CPORT -> RIP:RPORT and sends it on its way by means of ip_send(). As a result, the POST_ROUTING hook gets called, and ip_vs_post_routing() gets a look at the packet. It notices that the packet has been marked as belonging to LVS, and calls the (*okfn), sending the packet to the wire without further ado. When it has been transmitted, the reference count on the CTE falls to zero, and it is deleted. (This is a mild guess but I think is right.) Normally, CTEs avoid this fate because __ip_conntrack_confirm() is called for them, either via ip_confirm() as a late hook in LOCAL_IN, or through ip_refrag() called as a late hook in POST_ROUTING. "Confirming" the CTE involves linking it into some hash tables, and ensuring it isn't deleted. So this is the first problem --- the CTE is not "confirmed". Suppose we confirmed the connection. Then when the Real Server replies to this packet, it sends a packet addressed as RIP:RPORT -> CIP:CPORT to the director (because the Director is the router for such packets, as seen by the Real Server). Then the connection-tracking code in Netfilter on the director tries to look up the CTE for this packet, but can't find one. The CTE we /want/ it to match says ORIG: CIP:CPORT -> VIP:VPORT REPL: VIP:VPORT -> CIP:CPORT with no mention of the RIP:RPORT. So this reply packet gets labelled as "NEW", whereas we wanted it to be labelled as "ESTABLISHED". So as well as confirming the CTE, we also need to alter the REPLY tuple so that it will match the RIP:RPORT -> CIP:CPORT packet the Real Server sends back. Then everything will work. These two things are what the ip_vs_deal_with_conntrack() function does. Luckily there is a ip_conntrack_alter_reply() function exported by Netfilter, which we can use. Then we can also call the newly-exported __ip_conntrack_confirm() to confirm the connection. (We need to do the reply altering first because __ip_conntrack_confirm()ing the CTE uses the addresses in the ORIG and REPLY tuples to place the CTE in the hash tables, and we want it placed based on the /new/ reply tuple.) There is a slight complication in that the NAT code in Netfilter gets confused if addressing tuples change, so we need to tell the NAT code to re-place the CTE in its hash tables. This is done with the newly-exported ip_nat_replace_in_hashes() function. The ip_vs_deal_with_conntrack() function is called from the ip_vs_nat_xmit() function, since this whole problem only applies to LVS-NAT. It is only called if the CTE is unconfirmed. Hacking round a possible race ----------------------------- When testing this, we found that very occasionally there would be a problem when the Netfilter CTE timed out and was deleted. The code would fail an assertion: the CTE about to be deleted was not linked into the hash chain it claimed it was. This would happen after a few tens of thousands of connections from the same client to the same virtual service. We tracked this down to the above ip_vs_deal_with_conntrack() code being called for a CTE which already existed and was already confirmed. Doing this moved the CTE to a different hash chain and broke things. The only explanation I could come up with is that there is a race in the ip_vs code. The ip_vs code doesn't set up one timer per connection entry. Instead, it uses a kernel timer to do some work every second. I didn't look into this too deeply, but it looked like the following is a possibility. If the slow-timer code decides that a LVS connection should be expired, there seems to be a window where a packet can arrive and update that connection, meaning that it should no longer be expired. But it is anyway. There are more details; supplied on request. But if somebody who knows the timer code could check whether the above is a possibility, and fix it if so, that would be good. The workaround detects if the CTE is already confirmed, and deletes it and also drops the packet if so. Higher levels in the stack take care of retransmitting so nothing too drastic goes wrong. Later, we noticed the workaround being triggered much more often than we'd expect, and it turned out that incoming packets with the SYN and ACK bits both set were being treated as potentially starting new connections, whereas SYN/ACK packets are in fact a response to a connection initiated by the director itself. So we tightened the test to be ((h.th->syn && !h.th->ack) || (iph->protocol != IPPROTO_TCP)) instead of (h.th->syn || (iph->protocol!=IPPROTO_TCP)) which is how it is in the original LVS code. This doesn't seem to have caused any nasty side effects. Note that this only happened when an FTP virtual service was configured, because of the code in ip_vs_service_get() which allows a "wild-card" match for incoming FTP data connections. FTP connections --------------- The other main change is to the LVS FTP module. We add code to the two functions ip_vs_ftp_out() and ip_vs_ftp_in(), to deal with passive and active data transfers respectively. The basic idea is the same for both types of transfer. By keeping an eye on the actual traffic going between the client and the FTP server, we can tell when a data transfer is about to take place. For a passive transfer, the ip_vs_ftp module looks out for the string "227 Entering Passive Mode" followed by the address and port the server will listen on. For an active transfer, the client transmits the "PORT" command followed by the address and port the client will listen on. Once we have detected that a data transfer is about to take place, we add code to tell Netfilter's connection-tracking code to /expect/ the data connection. Then, packets belonging to the data connection will be labelled "RELATED" and can be allowed by firewall rules. There is an exported function ip_conntrack_expect_related(), which we call. The only difference between the set-up for passive and active transfers is that for passive transfers we don't know the port the client will connect from, so have to specify the source port as "don't care" by means of its mask. The ip_conntrack_expect_related() function allows us to specify a callback function; we use ip_vs_ftp_expect_callback() (new function in this patch). ip_vs_ftp_expect_callback() works out whether the new connection is for passive or active, modifies the REPLY tuple, and confirms the CTE. I've just noticed that I modify the reply tuple directly instead of calling ip_conntrack_alter_reply(). Can't see any good reason for this, so should probably change the code to use ip_conntrack_alter_reply() instead. Might not have time to test that change here, so will leave it alone for now. So to run a virtual FTP service, load the extra ip_vs_ftp module, but /not/ the ip_conntrack_ftp or ip_nat_ftp modules. It is very likely that the ip_vs_ftp module would not cooperate very well with those two modules, so if you want to run a non-virtual FTP service /and/ load-balance a virtual FTP service on the same machine, more work might be required. route_me_harder() ----------------- We call this function to possibly re-route the packet, because we were using policy routing (iproute2). This allows routing decisions to depend on more than just the destination IP address of the packet. In particular, a routing decision can be influenced by the source IP address of the packet, and by the fact that the packet should be treated as originating with the local machine. The call to route_me_harder() re-makes the routing decision in light of the new state of the packet. It could be removed (or disabled via a sysctl) if the overhead was too annoying in an application which didn't require this extra flexibility. Additional #defines ------------------- There are additional #defines available to add assertion-checking and various amounts of debugging to the output of the new code. #define BN_ASSERTIONS to include extra code which checks various things are as they should be. This adds a small amount of overhead (sorry, haven't measured it) but caught some problems in development. #define BN_DEBUG_FTP to emit diagnostic and tracing information from the modified ip_vs_ftp module. Again, was useful during development but probably not useful in production. #define BN_DEBUG_IPVS_CONN to emit diagnostic and tracing information from the new code which handles Netfilter's CTEs. Same comments apply: useful while I was working on it, but probably not in actual use. Copyright and Licence ===================== This patch is Copyright (C) 2001--2002 Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland. This code is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Contact: ben@xxxxxxxxxxxxx or glen@xxxxxxxxxxxxx --Ben North, 22 January, 2002.