Hello,
On Wed, 12 Aug 2015, Alex Gartrell wrote:
> The configuration of ipvs at Facebook is relatively straightforward. All
> ipvs instances bgp advertise a set of VIPs and the network prefers the
> nearest one or uses ECMP in the event of a tie. For the uninitiated, ECMP
> deterministically and statelessly load balances by hashing the packet
> (usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
> that number as an index (basic hash table type logic).
>
> The problem is that ICMP packets (which contain really important
> information like whether or not an MTU has been exceeded) will get a
> different hash value and may end up at a different ipvs instance. With no
> information about where to route these packets, they are dropped, creating
> ICMP black holes and breaking Path MTU discovery. Suddenly, my mom's
> pictures can't load and I'm fielding midday calls that I want nothing to do
> with.
>
> To address this, this patch set introduces the ability to schedule icmp
> packets which is gated by a sysctl net.ipv4.vs.schedule_icmp. If set to 0,
> the old behavior is maintained -- otherwise ICMP packets are scheduled.
>
> Alex Gartrell (12):
> ipvs: pull out ip_vs_try_to_schedule function
> ipvs: replace ip_vs_fill_ip4hdr with ip_vs_fill_iph_skb_off
> ipvs: Add hdr_flags to iphdr
> ipvs: drop inverse argument to conn_{in,out}_get
> ipvs: Make ip_vs_schedule aware of inverse iph'es
> ipvs: add schedule_icmp sysctl
> ipvs: Use outer header in ip_vs_bypass_xmit_v6
> ipvs: attempt to schedule icmp packets
> ipvs: ensure that ICMP cannot be sent in reply to ICMP
> ipvs: support scheduling inverse and icmp TCP packets
> ipvs: support scheduling inverse and icmp UDP packets
> ipvs: support scheduling inverse and icmp SCTP packets
What is missing in this patchset is a change to ip_vs_sh.c
Patch 1:
- we always provide offset 0 here:
IP_VS_DBG_PKT(7, af, pp, skb, 0, "unhandled fragment");
We may need to add 'offset' field to struct ip_vs_iphdr,
so that we can provide valid offset to the IP_VS_DBG_PKT and
IP_VS_DBG_RL_PKT macros. We do not provide inverse flag to the
pp->debug_packet handler but it is not so fatal.
Patch 2:
- ip_vs_out_icmp_v6: first args of ip_vs_fill_iph_skb_off can be on
same line
- ip_vs_in_icmp: any good reason the ip_vs_fill_iph_skb_off call to
be moved early? Better to keep it at the ip_vs_fill_ip4hdr's place.
It is still before the conn_in_get call.
- ip_vs_in_icmp: this line should not be removed:
"- offset = ciph.len;"
because offset is used later as arg to ip_vs_icmp_xmit.
offset provides offset to header but we later need to
skip this header and to reach the ports in transport header.
- ip_vs_in_icmp_v6: we better to rename offs_ciph to offset and
later to use it where 'writable' is used, then code will look like
ip_vs_in_icmp
Patch 3:
- breaks compilation of net/netfilter/ipvs/ip_vs_pe_sip.c and
net/netfilter/xt_ipvs.c
- do not forget that net/netfilter/xt_ipvs.c has calls to
ip_vs_fill_iph_skb and conn_out_get, include it in your .config
for building:
make net/netfilter/ipvs/ net/netfilter/xt_ipvs.o
- ip_vs_fill_iph_skb_off: if we later add more flags? Should we
provide instead just a hdr_flags arg with IP_VS_HDR_* mask?
- ip_vs_out_icmp_v6: first args for ip_vs_fill_iph_skb_icmp
should be on same line
- ip_vs_in_icmp_v6: first args for ip_vs_fill_iph_skb_icmp
should be on same line
- ip_vs_in_icmp_v6: I now see that this check from
commit 2f74713d1436 ("ipvs: Complete IPv6 fragment handling for IPVS")
'(hooknum == NF_INET_LOCAL_OUT) ? 0 : 1' is simply wrong for
local clients, we should always use 'true' instead of
(hooknum != NF_INET_LOCAL_OUT).
Patch 4:
- breaks compilation of net/netfilter/xt_ipvs.c
- ip_vs_schedule: first hunk shows that we provide inverse=1
to conn_in_get. This is the only odd place that does such
inverse check. May be we can change the bit before and after
conn_in_get?:
if (!skb->dev || skb->dev->flags & IFF_LOOPBACK) {
iph->hdr_flags ^= IP_VS_HDR_INVERSE;
cp = pp->conn_in_get(svc->af, skb, iph);
iph->hdr_flags ^= IP_VS_HDR_INVERSE;
if (cp) {
IP_VS_DBG_PKT(12, svc->af, pp, skb, 0,
"Not scheduling reply for existing "
"connection");
__ip_vs_conn_put(cp);
return NULL;
}
}
Patch 5:
- ip_vs_schedule: we should provide iph->offset as arg to IP_VS_DBG_PKT
Patch 6: OK
Patch 7: OK
Patch 8:
- ip_vs_in_icmp: may be we can avoid 2nd !cp check as follows?:
if (!cp) {
int v;
if (!sysctl_schedule_icmp(net_ipvs(net)))
return NF_ACCEPT;
if (!ip_vs_try_to_schedule(AF_INET, skb, pd, &v, &cp, &ciph))
return v;
new_cp = true;
}
- same for ip_vs_in_icmp_v6
Patch 9: OK
Patch 10:
- use {} after else if the 'if' part has {}
Patch 11:
- use {} after else if the 'if' part has {}
Patch 12:
- use {} after else if the 'if' part has {}
Regards
--
Julian Anastasov <ja@xxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
|