LVS
lvs-devel
Google
 
Web LinuxVirtualServer.org

ipvs: handle outgoing messages in SIP persistence engine

To: lvs-devel@xxxxxxxxxxxxxxx
Subject: ipvs: handle outgoing messages in SIP persistence engine
From: Marco Angaroni <marcoangaroni@xxxxxxxxx>
Date: Tue, 08 Mar 2016 18:17:25 +0100
Hello,

I’m trying to use IPVS as a high performance load-balancer with minimal 
SIP-protocol awareness (routing based on Call-ID header).
I started from the patch by Simon Horman as described here:
https://lwn.net/Articles/399571/.

However I found the following problems / limitations (I’m using LVS-NAT
and SIP over UDP):

  1) To actually have load-balancing based on Call-ID header, you need to
     use one-packet-scheduling (see Simon’s statement in above article: 
     “It is envisaged that the SIP persistence engine will be used in 
     conjunction with one-packet scheduling”). But with one-packet-
     scheduling the connection is deleted just after packet is forwarded, 
     so SIP responses coming from real-servers do not match any connection 
     and SNAT is not applied.

  2) If you do not use "-o", IPVS behave as normal UDP load balancer, so
     different SIP calls (each one identified by a different Call-ID) 
     coming from the same ip-address/port go to the same RS. So basically 
     you don’t have load-balancing based on Call-ID as intended.

  3) Call-ID is not learned when a new SIP call is started by a real-server
     (inside-to-outside direction), but only in the outside-to-inside 
     direction (see also my comment on the LVS-users mailing list: 
     
http://archive.linuxvirtualserver.org/html/lvs-users/2016-01/msg00000.html). 
     This is not specific to my deploy, but would be a general problem 
     for all SIP servers acting as B2BUA
     (https://en.wikipedia.org/wiki/Back-to-back_user_agent).

  4) One-packet-scheduling is the most expensive mode in IPVS from 
     performance point of view: for each packet to be processed a new 
     connection data structure is created and, after packet is sent, 
     deleted by starting a new timer set to expire immediately.

Below you can find two patches that I used to solve such problems.
At the moment, I just would like to have your opinion as IPVS experts,
and understand if such modifications can be considered a viable option
to solve problems listed above.
If you consider implementation is fine, I can think to submit such patches
later. Otherwise I would be really happy to receive suggestions about 
alternative implementations.
And if I simply misunderstood something, please let me know.

  p1) The basic idea is to make packets, that do not match any existent 
      connection but come from real-servers, create new connections instead
      of let them pass without any effect. This is the opposite of the 
      behaviour enabled by sysctl_nat_icmp_send, where packets that do not 
      match a connection but come from a RS generate an ICMP message back.

      When such packets pass through ip_vs_out(), if their source ip 
      address and source port match a configured real-server, a new 
      connection is automatically created in the same way as it would have
      happened if the packet had come from outside-to-inside direction.
      A new connection template is created too, if the virtual-service is 
      persistent and there is no matching connection template found.
      The new connection automatically created, if the service had "-o" 
      option, is an OPS connection that lasts only the time to forward the
      packet, just like it happens on the ingress side.

      This behavior should obviously be made configurable by adding a 
      specific sysctl (not implemented yet).
      This fixes problems 1) and 3) and keeps OPS mode mandatory for SIP-UDP,
      so 2) would not be a problem anymore.

      The following requisites are needed for automatic connection creation;
      if any is missing the packet simply goes the same way as usual.
      -  Virtual-Service is not fwmark based (this is because fwmark services
         do not store address and port of the Virtual-Service, required to 
         build the connection data).
      -  Virtual-Service and real-servers must not have been configured with 
         omitted port (this is again to have all data to create the 
         connection).

  p2) A costly operation done by OPS for every packet is the start of timer 
      to free the connection data. Instead of starting a timer to make 
      it expire immediately, I found it’s more efficient to call the expire
      callback directly (under certain conditions). In my tests, this more 
      than halved CPU usage at high loads on a virtual-machine with a single
      CPU, and seemed a good improvement for issue described in 4).

Thanks in advance,
Marco Angaroni

Subject: [PATCH 1/2] handle connections started by real-servers

Signed-off-by: Marco Angaroni <marcoangaroni@xxxxxxxxx>
---
 include/net/ip_vs.h             |   4 ++
 net/netfilter/ipvs/ip_vs_core.c | 142 ++++++++++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_ctl.c  |  31 +++++++++
 3 files changed, 177 insertions(+)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 0816c87..28db660 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1378,6 +1378,10 @@ ip_vs_service_find(struct netns_ipvs *ipvs, int af, 
__u32 fwmark, __u16 protocol
 bool ip_vs_has_real_service(struct netns_ipvs *ipvs, int af, __u16 protocol,
                            const union nf_inet_addr *daddr, __be16 dport);
 
+struct ip_vs_dest *
+ip_vs_get_real_service(struct netns_ipvs *ipvs, int af, __u16 protocol,
+                      const union nf_inet_addr *daddr, __be16 dport);
+
 int ip_vs_use_count_inc(void);
 void ip_vs_use_count_dec(void);
 int ip_vs_register_nl_ioctl(void);
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index f57b4dc..e3f5a70 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1099,6 +1099,132 @@ static inline bool is_new_conn_expected(const struct 
ip_vs_conn *cp,
        }
 }
 
+/* Creates a new connection for outgoing packets which are considered
+ * requests initiated by the real server, so that subsequent responses from
+ * external client are routed to the right real server.
+ *
+ * Pre-requisites:
+ * 1) Real Server is identified by searching source ip-address and
+ *    source port of the packet in RS table.
+ * 2) Virtual Service is NOT fwmark based.
+ *    In fwmark-virtual-services actual vaddr and vport are unknown until
+ *    packets are received from external network.
+ * 3) One RS is associated with only one VS.
+ *    Otherwise the first match found is used.
+ * 4) Virtual Service and Real Server must not have omitted port.
+ *    This is because all paramaters to create the connection must be known.
+ *
+ * This is outgoing packet, so:
+ * source-ip-address of packet is address of real-server
+ * dest-ip-address of packet is address of external client
+ */
+static struct ip_vs_conn *__ip_vs_new_conn_out(struct netns_ipvs *ipvs, int af,
+                                              struct sk_buff *skb,
+                                              const struct ip_vs_iphdr *iph)
+{
+       struct ip_vs_service *svc;
+       struct ip_vs_conn_param pt, pc;
+       struct ip_vs_conn *ct = NULL, *cp = NULL;
+       struct ip_vs_dest *dest;
+       __be16 _ports[2], *pptr;
+       const union nf_inet_addr *vaddr, *daddr;
+       union nf_inet_addr snet;
+       __be16 vport, dport;
+       unsigned int flags;
+
+       EnterFunction(12);
+       /* get net and L4 ports
+        */
+       pptr = frag_safe_skb_hp(skb, iph->len, sizeof(_ports), _ports, iph);
+       if (!pptr)
+               return NULL;
+       /* verify packet comes from a real-server and get service record
+        */
+       dest = ip_vs_get_real_service(ipvs, af, iph->protocol,
+                                     &iph->saddr, pptr[0]);
+       if (!dest)
+               return NULL;
+       /* check we have all pre-requisites
+        */
+       rcu_read_lock();
+       svc = rcu_dereference(dest->svc);
+       if (!svc)
+               goto out_no_new_conn;
+       if (svc->fwmark)
+               goto out_no_new_conn;
+       vaddr = &svc->addr;
+       vport = svc->port;
+       daddr = &dest->addr;
+       dport = dest->port;
+       if (!vport || !dport)
+               return NULL;
+       /* for persistent service first create connection template
+        */
+       if (svc->flags & IP_VS_SVC_F_PERSISTENT) {
+               /* apply netmask the same way ingress-side does
+                */
+#ifdef CONFIG_IP_VS_IPV6
+               if (af == AF_INET6)
+                       ipv6_addr_prefix(&snet.in6, &iph->daddr.in6,
+                                        (__force __u32)svc->netmask);
+               else
+#endif
+                       snet.ip = iph->daddr.ip & svc->netmask;
+               /* fill params and create template if not existent
+                */
+               if (ip_vs_conn_fill_param_persist(svc, skb, iph->protocol,
+                                                 &snet, 0, vaddr,
+                                                 vport, &pt) < 0)
+                       goto out_no_new_conn;
+               ct = ip_vs_ct_in_get(&pt);
+               if (!ct) {
+                       ct = ip_vs_conn_new(&pt, dest->af, daddr, dport,
+                                           IP_VS_CONN_F_TEMPLATE, dest, 0);
+                       if (!ct) {
+                               kfree(pt.pe_data);
+                               goto out_no_new_conn;
+                       }
+                       ct->timeout = svc->timeout;
+               } else {
+                       kfree(pt.pe_data);
+               }
+       }
+       /* connection flags
+        */
+       flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET) &&
+                iph->protocol == IPPROTO_UDP) ? IP_VS_CONN_F_ONE_PACKET : 0;
+       /* create connection
+        */
+       ip_vs_conn_fill_param(svc->ipvs, svc->af, iph->protocol,
+                             &iph->daddr, pptr[1], vaddr, vport, &pc);
+       cp = ip_vs_conn_new(&pc, dest->af, daddr, dport, flags, dest, 0);
+       if (!cp) {
+               ip_vs_conn_put(ct);
+               goto out_no_new_conn;
+       }
+       if (ct) {
+               ip_vs_control_add(cp, ct);
+               ip_vs_conn_put(ct);
+       }
+       ip_vs_conn_stats(cp, svc);
+       rcu_read_unlock();
+       /* return connection (will be used to handle outgoing packet)
+        */
+       IP_VS_DBG_BUF(6, "New connection RS-initiated:%c c:%s:%u v:%s:%u "
+                     "d:%s:%u conn->flags:%X conn->refcnt:%d\n",
+                     ip_vs_fwd_tag(cp),
+                     IP_VS_DBG_ADDR(svc->af, &cp->caddr), ntohs(cp->cport),
+                     IP_VS_DBG_ADDR(svc->af, &cp->vaddr), ntohs(cp->vport),
+                     IP_VS_DBG_ADDR(svc->af, &cp->daddr), ntohs(cp->dport),
+                     cp->flags, atomic_read(&cp->refcnt));
+       LeaveFunction(12);
+       return cp;
+
+out_no_new_conn:
+       rcu_read_unlock();
+       return NULL;
+}
+
 /* Handle response packets: rewrite addresses and send away...
  */
 static unsigned int
@@ -1244,6 +1370,22 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, 
struct sk_buff *skb, in
 
        if (likely(cp))
                return handle_response(af, skb, pd, cp, &iph, hooknum);
+       if (1 && /* TODO: test against specific systctl */
+           (pp->protocol == IPPROTO_UDP)) {
+               /* Connection oriented protocols should not need this.
+                * Outgoing TCP / SCTP connections can be handled separately
+                * with specific iptables rules.
+                *
+                * Instead with UDP transport all packets (incoming requests +
+                * related responses, outgoing requests + related responses)
+                * might use the same set of UDP ports and pass through the LB,
+                * so we must create connections that allow all responses to be
+                * directed to the right RS and avoid them to be balanced.
+                */
+               cp = __ip_vs_new_conn_out(ipvs, af, skb, &iph);
+               if (likely(cp))
+                       return handle_response(af, skb, pd, cp, &iph, hooknum);
+       }
        if (sysctl_nat_icmp_send(ipvs) &&
            (pp->protocol == IPPROTO_TCP ||
             pp->protocol == IPPROTO_UDP ||
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index e7c1b05..c8ad6f1 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -567,6 +567,37 @@ bool ip_vs_has_real_service(struct netns_ipvs *ipvs, int 
af, __u16 protocol,
        return false;
 }
 
+/* Get real service record by <proto,addr,port>.
+ * In case of multiple records with the same <proto,addr,port>, only
+ * the first found record is returned.
+ */
+struct ip_vs_dest *ip_vs_get_real_service(struct netns_ipvs *ipvs, int af,
+                                         __u16 protocol,
+                                         const union nf_inet_addr *daddr,
+                                         __be16 dport)
+{
+       unsigned int hash;
+       struct ip_vs_dest *dest;
+
+       /* Check for "full" addressed entries */
+       hash = ip_vs_rs_hashkey(af, daddr, dport);
+
+       rcu_read_lock();
+       hlist_for_each_entry_rcu(dest, &ipvs->rs_table[hash], d_list) {
+               if (dest->port == dport &&
+                   dest->af == af &&
+                   ip_vs_addr_equal(af, &dest->addr, daddr) &&
+                       (dest->protocol == protocol || dest->vfwmark)) {
+                       /* HIT */
+                       rcu_read_unlock();
+                       return dest;
+               }
+       }
+       rcu_read_unlock();
+
+       return NULL;
+}
+
 /* Lookup destination by {addr,port} in the given service
  * Called under RCU lock.
  */

Subject: [PATCH 2/2] optimize release of connections in one-packet-scheduling 
mode

Signed-off-by: Marco Angaroni <marcoangaroni@xxxxxxxxx>
---
 net/netfilter/ipvs/ip_vs_conn.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 85ca189..550fe3f 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -104,6 +104,9 @@ static inline void ct_write_unlock_bh(unsigned int key)
        spin_unlock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l);
 }
 
+/* declarations
+ */
+static void ip_vs_conn_expire(unsigned long data);
 
 /*
  *     Returns hash value for IPVS connection entry
@@ -453,10 +456,16 @@ ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 }
 EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
 
+static void __ip_vs_conn_put_notimer(struct ip_vs_conn *cp)
+{
+       __ip_vs_conn_put(cp);
+       ip_vs_conn_expire((unsigned long)cp);
+}
+
 /*
  *      Put back the conn and restart its timer with its timeout
  */
-void ip_vs_conn_put(struct ip_vs_conn *cp)
+static void __ip_vs_conn_put_timer(struct ip_vs_conn *cp)
 {
        unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ?
                0 : cp->timeout;
@@ -465,6 +474,22 @@ void ip_vs_conn_put(struct ip_vs_conn *cp)
        __ip_vs_conn_put(cp);
 }
 
+void ip_vs_conn_put(struct ip_vs_conn *cp)
+{
+       if ((cp->flags & IP_VS_CONN_F_ONE_PACKET) &&
+           (atomic_read(&cp->refcnt) == 1) &&
+           !timer_pending(&cp->timer))
+               /* one-packet-scheduling and last one referencing the
+                * connection: try to free connection data directly
+                * to avoid overhead of starting a new timer.
+                * If someone else will ever reference the connection
+                * just after the atomic_read, the ip_vs_conn_expire
+                * will delay and call __ip_vs_conn_put_timer as usual.
+                */
+               __ip_vs_conn_put_notimer(cp);
+       else
+               __ip_vs_conn_put_timer(cp);
+}
 
 /*
  *     Fill a no_client_port connection with a client port number
@@ -850,7 +875,7 @@ static void ip_vs_conn_expire(unsigned long data)
        if (ipvs->sync_state & IP_VS_STATE_MASTER)
                ip_vs_sync_conn(ipvs, cp, sysctl_sync_threshold(ipvs));
 
-       ip_vs_conn_put(cp);
+       __ip_vs_conn_put_timer(cp);
 }
 
 /* Modify timer, so that it expires as soon as possible.
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

<Prev in Thread] Current Thread [Next in Thread>