LVS
lvs-devel
Google
 
Web LinuxVirtualServer.org

Re: [PATCHv3 ipvs-next 2/3] netfilter: ipvs: Add Maglev consistent hashi

To: Inju Song <inju.song@xxxxxxxxxxxxx>
Subject: Re: [PATCHv3 ipvs-next 2/3] netfilter: ipvs: Add Maglev consistent hashing scheduler
Cc: lvs-devel@xxxxxxxxxxxxxxx, Simon Horman <horms@xxxxxxxxxxxx>, Wensong Zhang <wensong@xxxxxxxxxxxx>
From: Julian Anastasov <ja@xxxxxx>
Date: Sat, 24 Mar 2018 13:26:04 +0200 (EET)
        Hello,

        Only few notes below. If all goes well, v4 can be
the last version... :)

On Wed, 21 Mar 2018, Inju Song wrote:

> Implements the Google's Maglev hashing algorithm as a IPVS scheduler.
> 
> Basically it provides consistent hashing but offers some special
> features about disruption and load balancing.
> 
>   1) minimal disruption: when the set of destinations changes,
>      a connection will likely be sent to the same destination
>      as it was before.
> 
>   2) load balancing: each destination will receive an almost
>      equal number of connections.
> 
> Seel also for detail: [3.4 Consistent Hasing] in
> https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf

        Above commit message is properly indented but comments in
patches 1 and 3 should not be indented, there should not be
spaces at left.

> Signed-off-by: Inju Song <inju.song@xxxxxxxxxxxxx>
> ---
>  net/netfilter/ipvs/ip_vs_mh.c | 532 
> ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 532 insertions(+)
>  create mode 100644 net/netfilter/ipvs/ip_vs_mh.c
> 
> diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
> new file mode 100644
> index 0000000..e531100
> --- /dev/null
> +++ b/net/netfilter/ipvs/ip_vs_mh.c
> @@ -0,0 +1,532 @@

> +             IP_VS_DBG_BUF(6,
> +                           "MH: selected unavailable server %s:%d (offset 
> %d), reselecting",
> +                           IP_VS_DBG_ADDR(dest->af, &dest->addr),
> +                           ntohs(dest->port), roffset);

        As some vars are unsigned int, make sure all %d in
debug messages are changed to %u, for consistency. This
also includes all ports. I guess, these %d come from the
SH scheduler.

> +static void ip_vs_mh_state_free(struct rcu_head *head)
> +{
> +     struct ip_vs_mh_state *s;
> +
> +     s = container_of(head, struct ip_vs_mh_state, rcu_head);
> +     kfree(s->lookup);
> +     kfree(s);
> +}
> +
> +static int ip_vs_mh_init_svc(struct ip_vs_service *svc)
> +{
> +     struct ip_vs_mh_state *s;
> +
> +     /* allocate the MH table for this service */
> +     s = kzalloc(sizeof(*s), GFP_KERNEL);
> +     if (!s)
> +             return -ENOMEM;
> +
> +     s->lookup = kcalloc(IP_VS_MH_TAB_SIZE, sizeof(struct ip_vs_mh_lookup),
> +                         GFP_KERNEL);
> +     if (!s->lookup) {
> +             kfree(s);
> +             return -ENOMEM;
> +     }
> +
> +     generate_hash_secret(&s->hash1, &s->hash2);
> +     s->gcd = ip_vs_mh_gcd_weight(svc);
> +     s->rshift = ip_vs_mh_shift_weight(svc, s->gcd);
> +
> +     svc->sched_data = s;
> +     IP_VS_DBG(6,
> +               "MH lookup table (memory=%zdbytes) allocated for current 
> service\n",
> +               sizeof(struct ip_vs_mh_lookup) * IP_VS_MH_TAB_SIZE);
> +
> +     /* permutate & populate with current dests */
> +     return ip_vs_mh_reassign(s, svc);

        Looks like we must not set svc->sched_data on error.
We can do it like this:

        int ret;

        ...
        IP_VS_DBG()
        ret = ip_vs_mh_reassign(s, svc);
        if (ret < 0) {
                ip_vs_mh_reset(s); /* This is here for completeness */
                ip_vs_mh_state_free(&s->rcu_head);
                return ret;
        }
        /* No more failures, attach state */
        svc->sched_data = s;
        return 0;

        It should be ok because we use svc->sched_data only
from entry points, the scheduler does not use it internally.

> +}
> +
> +static void ip_vs_mh_done_svc(struct ip_vs_service *svc)
> +{
> +     struct ip_vs_mh_state *s = svc->sched_data;
> +
> +     /* got to clean up hash buckets here */
> +     ip_vs_mh_reset(s);
> +
> +     /* release the table itself */
> +     ip_vs_mh_state_free(&s->rcu_head);

        Above should be:

        call_rcu(&s->rcu_head, ip_vs_mh_state_free);

        As result, the RCU callback will be called after
RCU grace period giving chance to RCU readers to stop
using the svc->sched_data state.

> +     IP_VS_DBG(6, "MH lookup table (memory=%zdbytes) released\n",
> +               sizeof(struct ip_vs_mh_lookup) * IP_VS_MH_TAB_SIZE);

Regards

--
Julian Anastasov <ja@xxxxxx>
<Prev in Thread] Current Thread [Next in Thread>