On Wed, 15 Sep 2010 13:55:54 -0500
John Lash <jlash@xxxxxxxxxxxxx> wrote:
> I'm running two systems with ipvs and keepalived. They are a localnode
> configuration, that is, the director is also a realserver.
> I've found that when I have traffic up and running I can shutdown the
> realserver on the director with only a brief burst of failures. My problem is
> when I shutdown the realserver on the non-director system.
> I have a high traffic load (http) up and running traffic from one
> single-threaded client. Life is good.
> Then I shutdown the server on the non-director and I get a burst of
> connection failures (Connection refused). That clears up quickly and
> connections start flowing again.
> The problem is that then I see about 10 to 20 seconds of successful
> transactions, followed by a period of about a minute where I'm getting
> connection timeouts every other time (I'm using rr). Then I move into a
> period for the next fifteen minutes where there will be several timeouts
> about every 20 seconds but otherwise normal traffic.
> The initial "Connection refused" failures happen till keepalived turns off
> the downed realserver. The part I don't understand is why after seeing
> traffic come back, I start seeing the timeouts. I've hooked up tcpdump on the
> director and it shows me that every other connection is not getting a
> response. I looked at tcpdump on the "downed" realserver and there are no odd
> packets arriving for the loadbalanced VIP and port and no evidence of
> "connection refused" back at my client.
> keepalived logs don't give any indication that it's healthcheckers are
> bouncing around. ipvsadm -l --stats only shows the functioning realserver.
> Does anybody have an idea what's going on here?? This is completely
> reproducible and the timing of the connection errors is also consistent.
I figured it out finally. I needed to set expire_nodest_conn. That fixed the
connections timeouts that were trailing off for 15 minutes.
The first burst of errors at about T-minus 30 seconds was a bit harder to
figure out. On faster machines it happened sooner, on slower machines it
happened later. Turned out that there were a few stuck entries in the
connection table pointing to my downed server in the SYN_RECV state.
The test client I was running was just banging connections at the director as
fast as possible and again, depending on how fast the machines were, after
about 20 to 40 seconds it wrapped the client port number around so I was
hitting the stuck entries. The connections were apparently dropped. By the time
this happened it took about another 10 to 20 seconds for that entry to time out
(those don't seem to respond to expire_nodest_conn). Don't know if that's by
design or not.
Please read the documentation before posting - it's available at:
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users