LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: load balancing trouble at a high load

To: <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: load balancing trouble at a high load
From: "Hideaki Kondo" <toreno4257@xxxxxxxxx>
Date: Sun, 2 Jul 2006 22:57:14 +0900
Hello,

I almost found out the reason of the limit (about 28230) of ActConn + InActConn.
The default of /proc/sys/net/ipv4/ip_local_port_range is "32768 61000".
In short, 61000 - 32768 = 28232.
The number of  client of our test environment is one.

The hash key of ip_vs_conn_tab (connection table) is based on
protocol, s_addr(caddr), s_port(cport), d_addr(vaddr), and d_port(vport).
So I think that the max limit of hash key produced by hash function
is 28232(default) for one client to same virtual server. Therefore, I think the limit of ActConn + InActConn for every client at a high load exists and the number of hash key for ip_vs_conn_tab
from same client to same virtual server (to a realserver) is full.

So I think that strange behavior at a high load was occured by
the above reason.
In short, the cause of the load balancing trouble at a high load is mainly
related to ip_vs_conn table managed by hash key based on the above elements
and the limit of port range of a client But I think that this specification of ip_vs is no problem in real environment.
# I'm very sorry for my poor English.

Thanks a lot.
Best regards,

----- Original Message ----- From: "Hideaki Kondo" <kondo.hideaki@xxxxxxxxxxxxx>
To: <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Thursday, May 25, 2006 7:22 PM
Subject: Re: load balancing trouble at a high load



Hello,

i give a supplementary explanation about my report
because of the lack of my information.

<<Trouble Process>>
(1)give a high load to LB1 by while_wget & while_ab from CL1.
  (while_wget and while_ab are simple shell scripts.
   while_wget repeats "wget -O index.html http://192.168.0.101:80/index.hmtl";
without sleep. while_ab also repeats "ab -n 10000 -c 10 http://192.168.0.101:80/index.html";
   without sleep.)
   LB1 is correctly loadbalancing to RS1 and RS2 by RoundRobin at this time.
(2)After a few minutes, it seems to be reached to max limit? of ActiveConns + InActiveConns.

It seems that the trouble always occurs when ActiveConn + InActConn is
close to total 28231 as follows.

------------------------------------------------------------------------
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
 -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.0.101:http rr
-> rs02:http Masq 1 0 1 -> rs01:http Masq 1 1 28229
------------------------------------------------------------------------

Then crash the NIC(eth0) of RS2 intentionally by executing manually "ifconfig eth0 down".
  (while_wget & while_ab become to be the state of freeze at this time.
   This state is no problem.)
   And change weight 1 to 0 by executing manually "ipvsadm -e -t 
192.168.0.101:80
   -r 192.168.1.2:80 -w 0 -m". (I tested this process without ldirectord
   for checking the behavior of LVS in detail.)

(3)give a new high load to LB1 by while_wget & while_ab from CL1
   instead of the old high load in (1).
   LB1 is correctly sending http packtes only to RS1 at this time.

(4)And then recover the NIC(eth0) of RS2 intentionally by executing manually
   "/etc/init.d/network restart".
   After a while, LB1 starts sending http packets to RS1 and RS2 in spite of
   still weight 0 of RS2. Moreover, LB1 is sending the packets to RS2 much
   less than RS1.
(This strange behavior continues permanently. So I think the cause of the behavior isn't always in a retransmit process of TCP Layer.
   In fact, the strange behavior stops when i stop the high load from CL1)

i applied "IP virtual server debugging" in "make menuconfig",
made kernel-2.6.9-22.EL, and then applied "net.ipv4.vs.debug_level=15"
in sysctl.conf and applied "kern.*  /var/log/kernel.log" in syslog.conf.
As far as checking kernel.log for LVS,  LB1 doesn't seems to be
loadbalancing to RS2 at this stage (4). ???
So the cause of the strange behavior can't deny to be related with
the retransmit process etc of TCP Layer.

Checking by "ipvsadm -Lc", there are many TIME_WAIT states,
it seems that InActConn number is reflected them.
By the way, refering to ip_vs source code (ip_vs_proto_tcp.c),
IP_VS_TCP_S_TIME_WAIT is 2*60*HZ.
When i changed IP_VS_TCP_S_TIME_WAIT 2*60*HZ to 10*Hz etc (much smaller
than 2*60*Hz), i think it seems to be improved the strange behavior.
Is IP_VS_TCP_S_TIME_WAIT related with the cause of the trouble ?
i think some timers in LVS are related with the behavior ...??


(5)After checking this strage behavior for a while, change weight 0 to 1 by
   executing manually "ipvsadm -e -t 192.168.0.101:80 -r 192.168.1.2:80 -w 1 
-m".
   But the strange behaivor still continues eternally, LB1 is sending the 
packets
   to RS2 much less than RS1 in spite of Round-Robin.

Certainly, there are two cases.
One case is that LB1 is sending the packets to RS2 much less than RS1,
the other case is  that LB1 is not sending the packets to RS2 at all.
In fact, these cases are also same in (4).


(6)Then stop all high load (while_wget & while_ab) from CL1, and wait for a few
   minutes by becoming to be close to 0 about ActiveConns + InActiveConns.
   And start a new high load from CL1 by while_wget & while_ab, then
   LB1 is correctly and evenly loadbalancing to RS1 and RS2 as same as (1)

Is this trouble related with some timers, u_threshold, dest->flag
(IP_VS_DEST_F_OVERLOAD/IP_VS_DEST_F_AVAILABLE) etc in LVS ?
Is this strange behavior correct for the specification of LVS ?
(I can't understand the specification of LVS in detail.)
Is there anything about how to cope with this trouble ?

I'm sorry for my many questions.
If you have information or hints etc about the trouble,
Would you please teach me about them ?


Thanks in advance.
Best regards,

--
Hideaki Kondo

--
Hideaki Kondo


<Prev in Thread] Current Thread [Next in Thread>