LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: ipvsadm ..set / tcp timeout was: ipvsadm --set... what does it modi

To: Andreas John <lists@xxxxxxxxxxxxxx>
Subject: Re: ipvsadm ..set / tcp timeout was: ipvsadm --set... what does it modify?
Cc: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Tue, 14 Feb 2006 00:55:10 +0100
Dear Andreas,

I'm currently not able to reply in-depth to your email because of special work I'm doing, which requires my full attention. I'll be back in April, if you're still interested.

implemented. Of course I could fly down to Julian's place over the
week-end and we could implement it together; want to sponsor it? ;).
Well, currently I am not in the position to sponsor such development,
but things may change. If that casse I would contact you via p-mail.

Fair enough.

(Ryanair does not fly to .bg .... :/ )

It's not easy to get to Julian's home. One has to fly to Sofia and then bargain for a ticket to fly down to Varna :).

As you are from .ch (as your domain tells me ;)) ... will you be at
linuxtag (Wiesbaden, Germany?).

Unlikely, depends on how many other kernel developers I know will attend it. The conference is too much user space centric :). Also Wiesbaden is in the middle of nowhere and during this time I will probably be in Morokko surfing.

And I looked at Julian's proposal. Ufff. I not a fluent C speaker, so if
I would try to do that, I would recommend not to use it :)

To be fair, we should probably just re-enable the proc-fs based timer settings until we have a replacement. Maybe if I find my vim on time, I can cook something up for 2.6.

eh, I got it form /proc:

# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200

Ahh, but that's not 2h + 9*75s :).
May I quote from the document you mentioned and comment: ip-sysctl.txt

It's better to quote from the source, since ip-sysctl.txt can be inaccurate sometimes; not in this case though.

tcp_keepalive_time - INTEGER: How often TCP sends out keepalive messages
when keepalive is enabled. Default: 2hours.

-> The kernel tries all 2 houres to send probes and check if the
connection is still there?

tcp_keepalive_probes - INTEGER: How many keepalive probes TCP sends out,
until it decides that the connection is broken. Default value: 9.

-> The kernel tries 9 times to send such probes. The probes will be 75
secs (tcp_keepalive_intvl) distance. After all 9 failed, the kernel will
drop the connection.

That brings me back to my 7200 + 9 * 75 secs. But it may be only 7200 +
8 * 75 secs, because the value does say noting about the timeout of the
last IP paket.... errrrh </confused> ...

You're still referring to sockets, where as IPVS has nothing to do with sockets. I'm sorry but due to time constraints I have to refer you to reading the source, Luke :):

Start with net/ipv4/tcp.c:tcp_setsockopt()

  case TCP_KEEPIDLE:
      if (val < 1 || val > MAX_TCP_KEEPIDLE)
              err = -EINVAL;
      else {
              tp->keepalive_time = val * HZ;
              if (sock_flag(sk, SOCK_KEEPOPEN) &&
                  !((1 << sk->sk_state) &
                    (TCPF_CLOSE | TCPF_LISTEN))) {
                      __u32 elapsed = tcp_time_stamp - tp->rcv_tstamp;
                      if (tp->keepalive_time > elapsed)
                              elapsed = tp->keepalive_time - elapsed;
                      else
                              elapsed = 0;
                      inet_csk_reset_keepalive_timer(sk, elapsed);
              }
      }

The important thing is the inet_csk_reset_keepalive_timer() which ends up calling sk_reset_timer() in net/core/sock.c, if I'm not mistaken.

There's a lot of TCP timers in the Linux kernel and they all have
[...]
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_time_wait:120

                     ^^^^^^^^^
Are you on drugs or am I too dumb :) ?

Depending on your perception and understanding of these the answer ranges from neither to both ;).

What do the netfiler timeouts
have to do with the tcp_keepalive in general?

Nothing, I hope I've never stated this in my previous emails.

Or did you respond to the
resource i mentioned (this was about netfilter but I was only interested
in the little part about tcp_keepalive in general ...)

Regarding IPVS tcp_keepalive has not much to say. IPVS maintains its own state timers and is generally pretty unimpressed by socket timers. Of course if it expires on the client's end, we get to see a RST and according to the state table, you can figure out what happens next; here is an excerpt for your viewing pleasure (../ipvs/ip_vs_proto_tcp.c):

static struct tcp_states_t tcp_states [] = {
/*      INPUT */
/*        sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }},

/*      OUTPUT */
/*        sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */
/*syn*/ {{sSS, sES, sSS, sSR, sSS, sSS, sSS, sSS, sSS, sLI, sSR }},
/*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }},
/*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }},
/*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }},

/*      INPUT-ONLY */
/*        sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }},
};

And of course we have the IPVS TCP settings, which look as follows (if
they weren't disabled in the core :)):

               ^^^^^ disabled? why should we ?

Noone has complained before. I have re-instated them in the 2.4.x kernel because I needed them. I don't need the fine granular settings Julian is proposing, but I can envision someone have a use for them. However, not having the ability to set state timeouts is not a good tradeoff to me.

timeouts. Since there is no socket (as in an endpoint) involved when
doing either netfilter or IPVS, you have to guess what the TCP flow
in-between (where you machine is "standing") is doing, so you can
continue to forward, rewrite, mangle, whatever, the flow, _without_
disturbing it. The timers are used for table mapping timeouts of TCP
states. If we didn't have them, mappings would stay in the kernel
forever and eventually we'd run out of memory. If we have them wrong, it
might occur that a connection is aborted prematurely by our host, for
example yielding those infamous ssh hangs when connecting through a
packet filter.
Yes, I am asking myself is, if wee see a FIN or RST fyling through a tco
connection, we could lower the timer significantly or even dropp the
entry because we may know what's happening?

In triangulation mode (LVS_DR) this is impossible to find out, unless you use the forward shared approach written by Julian. I'm actually also not very happy about the current way of detecting TCP state changes, however this is how it is. If you're interested, ipvs/ip_vs_proto_tcp.c contains the related code which decides when a connection state gets which timeout; just check out set_tcp_state().

My current problem is a client connection with a keepalive foo to a DB
balanced (via NAT) to two real servers. The client opens a forms, types,
then parks his browser and going to lunch for a hour ot two. Then he
comes back and presses "sumbit".... validation error....

Umm, why do your clients eat? Maybe you should print a warning into your forms, that lunch is forbidden under penalty of cleaning up the database afterwards. Seriously though, there's really nothing IPVS can do for you at this stage, since the client gets a TCP RST. Increase the TCP keepalive settings and set an equally high persistency timeout.

IMVHO the best solution would be a javascript on the client that tries
to get 1 byte of from the DB  every some minutes, but ....

Won't work on most client's browsers nowadays, since mostly use HTTP/1.1 with keepalive and pipelining and thus have at least 2 concurrent sockets open, and JS does not know anything about sockets. Try fiddling with the apache keepalive settings.

The tcp keepalive timer setting you've mentioned, on the other hand, is
per socket. And as such only has an influence on locally created or
terminated sockets. A quick socket(2) and socket(7) skimming reveil:

[....]

If SO_KEEPALIVE is not enabled, the session will cease to exist after
2h? Is anyone here aware what the default values/ranges of win machines are?

Read again, tcp(7):

       tcp_keepalive_time
              The number of seconds a connection needs to be idle
              before TCP begins sending  out  keep-alive  probes.
              Keep-alives  are  only  sent  when the SO_KEEPALIVE
              socket option is enabled.   The  default  value  is
              7200 seconds (2 hours).  An idle connection is ter­
              minated after approximately an additional  11  min­
              utes  (9  probes  an  interval of 75 seconds apart)
              when keep-alive is enabled.

              Note that underlying connection tracking mechanisms
              and application timeouts may be much shorter.

Of course you can also use tcpdump to study the wickedness of tcp_keepalive ...

On top of that the keepalive timer has two different meanings, depending if we are in a minisocket or in socket, state ESTABLISHED.

I'm a bit missing this view from your cited reference. I did not read it
through thoroughly. My apologies for not being more specific, however I
don't have more time right now.

Well, first of all thanks a lot for your expertise!

Sorry, I can't take more time right now to explain those complex issues. If you're interested in more TCP/IP stuff doing in Linux, I suggest you get a copy of "The Linux TCP/IP Stack, Networking for Embedded Systems" by Thomas F. Herbert, ISBN: 1-58450-284-3; or read the source, which contains all the original swearing of the Linux TCP stack creators in a director's cut version.

Best regards,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread] Current Thread [Next in Thread>