LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Very strange behavior

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: Very strange behavior
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Thu, 28 Feb 2002 11:21:46 +0100
Hi,

Active connections and Inactive connections show all zeros with the
exception of one real server that shows up to 4 connections.  Here's the
ipvsadm output when working correctly after a reboot.

IP Virtual Server version 1.0.8 (size=524288)
                                        ^^^^^^
This is a rather high hash table size, you don't need to set it so high.

Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  216.151.100.246:443 rr persistent 600
-> 216.151.100.253:443 Route 1 0 0 TCP 216.151.100.246:5555 rr -> 216.151.100.251:5555 Route 1 5 0 TCP 216.151.100.246:80 rr -> 216.151.100.253:80 Route 1 300 7984 -> 216.151.100.252:80 Route 1 324 8218
When the phones start ringing, the output is the same except activeconn
           ^^^^^^^^^^^^^^^^^^^^
      How can I understand/interpret that?

and inactconn report zero with the exception of 216.151.100.252 which
reports 0-3 activeconn and 0 inactconn.

Just to make sure: We only and always talk about the service defined by 216.151.100.246:80?

I don't get it. Do you understand you correctly, if I say:

o everything works after a reboot for a while with service VIP:80
o after a undefined amount of time your picture looks like follows:

 TCP  216.151.100.246:80 rr
   -> 216.151.100.253:80             Route   1      0          0
   -> 216.151.100.252:80             Route   1      2          0

o no request will be forwarded to ~.253 anymore from that point on
o flushing doesn't help (this is clear, actually)
o we have only a trace of a working setup

Nothing but rebooting the director seems to get rid of the problem.

That I doubt, but lets have a clean analysis.

Nothing is a very strong word.  I doubt this as well but it's all I've
seen that fixes the problem.

Ok, how long does it take until this situation occurs normally? Do you send requests during the reboot?

Here's a dump from the director taking a request for the farm on
216.151.100.246:80

With what do you send your requests? They all seem to have the same CIP:CIPPort.

384:384(0) ack 8704 win 23168 <nop,nop,timestamp 379374999 56808235> (DF)
16:30:48.586848 eth0 > 64.42.115.38.42710 > 216.151.100.246.www: .
384:384(0) ack 8704 win 23168 <nop,nop,timestamp 379374999 56808235> (DF)
16:30:48.591695 eth0 < 64.42.115.38.42710 > 216.151.100.246.www: F
384:384(0) ack 9449 win 26064 <nop,nop,timestamp 379374999 56808235> (DF)
16:30:48.591800 eth0 > 64.42.115.38.42710 > 216.151.100.246.www: F
384:384(0) ack 9449 win 26064 <nop,nop,timestamp 379374999 56808235> (DF)

Could you send us following infos, please?

o From the director:
    ip link show dev eth0
    ip neigh show dev eth0

o From the RIPs:
    ip link show dev eth0
    ip neigh show dev eth0

Here's a dump from the real server 216.151.100.253:80 following the above
request.

Thank you. Can you check if ~.252 does something, when the request is being forwarded to ~.253 like in that case? Just check if tcpdump -len on ~.252 jumps, when it shouldn't.

tcpdump: listening on eth0
14:12:27.309782 64.42.115.38.42710 > 216.151.100.246.http: S
1546816788:1546816788(0) win 5840 <mss 1460,sackOK,timestamp 379374974
0,nop,wscale 0> (DF)
14:12:27.309813 216.151.100.246.http > 64.42.115.38.42710: S
889182054:889182054(0) ack 1546816789 win 5792 <mss 1460,sackOK,timestamp
56808216 379374974,nop,wscale 0> (DF)
14:12:27.364463 64.42.115.38.42710 > 216.151.100.246.http: . ack 1 win
5840 <nop,nop,timestamp 379374980 56808216> (DF)

Good.

14:12:27.377807 64.42.115.38.42710 > 216.151.100.246.http: P 1:385(384)
ack 1 win 5840 <nop,nop,timestamp 379374980 56808216> (DF)
14:12:27.377840 216.151.100.246.http > 64.42.115.38.42710: . ack 385 win
6432 <nop,nop,timestamp 56808223 379374980> (DF)
14:12:27.378621 216.151.100.246.http > 64.42.115.38.42710: P 1:16(15) ack
385 win 6432 <nop,nop,timestamp 56808223 379374980> (DF)
14:12:27.382864 216.151.100.246.http > 64.42.115.38.42710: . 16:1464(1448)
ack 385 win 6432 <nop,nop,timestamp 56808224 379374980> (DF)
14:12:27.434084 64.42.115.38.42710 > 216.151.100.246.http: . ack 16 win
5840 <nop,nop,timestamp 379374987 56808223> (DF)
14:12:27.434108 216.151.100.246.http > 64.42.115.38.42710: .
1464:2912(1448) ack 385 win 6432 <nop,nop,timestamp 56808229 379374987>
(DF)
14:12:27.434113 216.151.100.246.http > 64.42.115.38.42710: .
2912:4360(1448) ack 385 win 6432 <nop,nop,timestamp 56808229 379374987>
(DF)
14:12:27.448461 64.42.115.38.42710 > 216.151.100.246.http: . ack 1464 win
8688 <nop,nop,timestamp 379374988 56808224> (DF)
14:12:27.448496 216.151.100.246.http > 64.42.115.38.42710: .
4360:5808(1448) ack 385 win 6432 <nop,nop,timestamp 56808230 379374988>
(DF)
14:12:27.448501 216.151.100.246.http > 64.42.115.38.42710: .
5808:7256(1448) ack 385 win 6432 <nop,nop,timestamp 56808230 379374988>
(DF)
14:12:27.500357 64.42.115.38.42710 > 216.151.100.246.http: . ack 4360 win
14480 <nop,nop,timestamp 379374993 56808229> (DF)
14:12:27.510547 64.42.115.38.42710 > 216.151.100.246.http: . ack 5808 win
17376 <nop,nop,timestamp 379374994 56808230> (DF)
14:12:27.514784 64.42.115.38.42710 > 216.151.100.246.http: . ack 7256 win
20272 <nop,nop,timestamp 379374994 56808230> (DF)
14:12:27.559938 64.42.115.38.42710 > 216.151.100.246.http: . ack 8704 win
23168 <nop,nop,timestamp 379374999 56808235> (DF)

Just out of curiousity, is CIP a Windows client or do you have bandwidth limitation somewhere on a router?

14:12:27.564873 64.42.115.38.42710 > 216.151.100.246.http: F 385:385(0)
ack 9449 win 26064 <nop,nop,timestamp 379374999 56808235> (DF)
14:12:27.564891 216.151.100.246.http > 64.42.115.38.42710: . ack 386 win
6432 <nop,nop,timestamp 56808242 379374999> (DF)

I'm not familiar with the ipvs debug output.  I set debug_level to 7.  A
lot of the output looks truncated for some reason.  I don't know what
corresponds to my connection so here's a chunk of what's currently going
on.

Your ringbuffer is full and the kernel cannot log the stuff anymore. Either you do have a lot of incoming connections/sec or you log something else like hell or you have a 80486 as a director. Try lowering debug_level to 4 or so and see if it's still truncated. Do you run syslogd/klogd or syslog-ng?

Feb 27 16:58:11 lb kernel: TCP 65.58.153.251:1255->216.151.100.246:80 hit Feb 27 16:58:13 lb kernel: .151.100.253:80 activeconns 316 refcnt 9447 weight 1 Feb 27 16:58:18 lb kernel: ip_vs_rr_schedule(): Scheduling... Feb 27 16:58:19 lb kernel: _service fwm 0 TCP 216.151.100.246:80 hit Feb 27 16:58:21 lb kernel: 0.252:80 activeconns 333 refcnt 9371 weight 1 Feb 27 16:58:22 lb kernel: 6.151.100.246:80 hit Feb 27 16:58:23 lb kernel: : look/in TCP 66.99.108.21:1285->216.151.100.246:80 hit

Sorry, this doesn't help too much like that, I see that connections are being forwarded correctly. And the other problem is, that this is a case where everything works. I need this output when your problem shows up.

I hope this helps.  I don't have any output from when the problem is
occuring, but I'm sure I will see it again.  I'll be sure to note the
debugging information and record some tcpdump info.  Let me know if you
need anything else.

The arp's I think to verify your setup. Not that we don't trust you, but more eyes are always better ;)

Best regards,
Roberto Nibali, ratz



<Prev in Thread] Current Thread [Next in Thread>