I'm using keepalived to distribute DNS requests (UDP port 53) to a
group of DNS servers. The farm is using source hashing. Environment
is RHEL, with the stock keepalived and IPVS. I've reproduced the
problem with RHEL 7.2, 6.8, and an older 6.x version.
When a health check fails and keepalived takes a real server out of the
farm, tests show that a client using the removed server has its packets
discarded until it is remapped to a new server. I can also provoke the
problem without keepalived, by using ipvsadm to remove a real server from
the farm.
I ran tcpdump on the load-balancing server during the test. When the
IPVS load balancing is working as expected, I see the packets arrive
on the incoming interface (a 2-interface bond) and then immediately get
forwarded to a real server. We are using direct response so there's
no manipulation of the IP headers.
After the real server is removed from the farm, requests from clients
that were hashed to that server still arrive, but they don't get forwarded
out. I haven't calculated all the numbers yet, but on a farm that gets
roughly 7500 requests per second, when one of the five real servers is
removed, around 3400 requests do not get forwarded. Under various test
scenarios it can take as long as a second for the farm to work normally
again from the impacted clients' perspective - the problem gets worse
when request rates are increased.
I didn't see any loss for clients who were not using the removed server
during the transition. I also didn't see any loss when a real server
was added back into the farm.
When I change the farm from source hashing to round-robin, the problem
is reduced by an order of magnitude - instead of hundreds of lost
requests, I get at most a few dozen.
I'm kind of stuck at this point as I don't know much about IPVS internals.
I've looked at the IPVS stats in /proc but those only cover packets
successfully processed, there don't seem to be any counters for errors
or drops.
iptables is in use on the load balancer hosts (a very simple list with 3 or
4 drop rules), but in my test environment I didn't see any difference when
the iptables modules were unloaded ("service iptables stop" and then
confirmed with lsmod). The modules iptables uses when in NAT mode (I think
it's nf_conntrack and a couple of others) are already blacklisted as they
caused havoc one day last year when they were accidentally loaded.
So my questions are:
* Could there be a bug in the connection table code when a real server
is removed and the farm mappings have to be recalculated?
* Is it realistic to expect that no packets will be dropped when a real
server is removed from the farm?
* If not, what can I do to minimize the packet loss?
Thanks,
-- Ed
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
|