LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: RedHat ES3 LVS-Nat - Arp issues

To: Michael Sztachanski <michael.sztachanski@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: RedHat ES3 LVS-Nat - Arp issues
Cc: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Tue, 28 Sep 2004 11:37:53 +0200
Dear Michael,

Do you use Hubs or Switches?

Cisco Switches, not sure of their config as they are handled by our network 
services department.

Ok, so you definitely have a secluded collision domain. I'm only asking because if you were still using hubs your L2 collision domain would span further than anticipated and could thus cause those massive arp entries in the routing cache you're seing.

Am getting copious amounts of arpping and caching at eth0 on both LVS
routers. I'm expecting 4000 users to go though this LVS, will that much
arp traffic on the eth0 side kill connections? I have already increased
the arp cache size to 4096, but I'm
still getting overflows.

Which settings did you perform exactly?

Adjusted the gc_thresh from 1024 in  
/proc/sys/net/ipv4/neigh/default/gc_thresh3 to 4096.

Just to make sure, how do your other values look? in this directory? Also note that setting ../default/<key>=<value> does not help the current situation. {default} in proc-fs is used only when a new device is created, which is not the case in your setup. So I reckon your /proc/sys/net/ipv4/neigh/{eth0,eth1}/gc_thresh3 are still as low as they were at boot time. You would need to adjust those as well.

What are your gc_thresh* settings? How big is your neighbour table?

there are over 1900 entries

Either it's the thing I mentioned in the last paragraph or the dst cache GC doesn't kick in for some reason which I would be very interested in debugging :).

nat_router = 172.24.24.1 eth1:1
nat_nmask = 255.255.255.0
debug_level = NONE
virtual gnetest {
    active = 1
    address = 10.0.1.99 eth0:1
    vip_nmask = 255.255.248.0

why not 255.255.255.255?
Are you asking about the vip_netmask or the nat_netmask?

The vip_nmask.

The netmaks shown are our internal masks.

Yes, and this is also correct.

The values are as per the RH documentation.
There was no mention of the value you have mentioned.

Strange. It's not an absolute must but it's an advantage because then only arp probe for VIP is used instead of a collapsing probe range.

    port = 80
    persistent = 3600

do you need such a high persistency?

The Users talk to Web App on IIS web servers that talk to a database
that requires a minium of 1hr persistancy.

Ok, I also see from your GUI output that either your application or your webservers are kind of extremely busy for a request. Must be a complex site.

The VIP should have 255.255.255.255 as a mask. The RH Doco had 255.255.255.0. Sorry for my ignorancy. What is the reason for this.

The reason is that your netmask for the VIP has an overlap with your primary IP on your physical interface which sends arp replies for both IPs. It would be wise to only have a VIP/32 which would not reply (stack-wise) to {eth0}/21.

eth0:88 Link encap:Ethernet HWaddr 00:0D:60:9C:08:86 inet addr:10.0.1.88 Bcast:10.0.7.255 Mask:255.255.248.0 eth0:89 Link encap:Ethernet HWaddr 00:0D:60:9C:08:86 inet addr:10.0.1.89 Bcast:10.0.7.255 Mask:255.255.248.0

What are eth0:88 and eth0:89 for?
These are external address 10.0.1.88 and 89 that I'vve NATed in
the iptables rules to 172.24.24.2 and 3.
This is so the developers can RDP to each box individually.

Why? It's your internal network. Why can't they be reached over the 10.0.0.0/21 net? Your network setup is rather confusing to me ;). And why do you need to NAT at all to reach a local address?

Taken at 16:25 form the RH Web GUI.
IP Virtual Server version 1.0.8 (size=65536)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.0.1.99:80 rr persistent 360000 FFFFFFFF
                                 ^^^^^^
                      just to make sure: this is what you want?

-> 172.24.24.21:80 Masq 1 540 8
-> 172.24.24.22:80 Masq 1 639 1

Those are really big numbers of active connections, I wonder what kind of application is taking so long to serve a request.

Take care,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
<Prev in Thread] Current Thread [Next in Thread>