LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: DR Load balancing active/inactive connections

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: DR Load balancing active/inactive connections
From: RU Admin <lvs-user@xxxxxxxxxxxxxxxxxx>
Date: Tue, 28 Nov 2006 08:37:28 -0500 (EST)
On Tue, 28 Nov 2006, Horms wrote:

On Tue, Nov 21, 2006 at 08:57:59AM -0500, RU Admin wrote:

I've been using IPVS for almost two years now, I started out with 6
machines (1 director, 5 real servers) and was using LVS-NAT.  During
the first year that I was running that email server everything worked
perfectly with LVS-NAT.  About a year ago, I decided to setup another
email server, this time with 5 machines (1 director, 4 real servers)
and decided it was time to get LVS-DR working, which I successfully
did.  I then decided to switch over my first email server (the one
with 6 machines) to LVS-DR, since the other LVS-DR server was working
great. Both of my email servers have been working great with LVS-DR
for the past year, with one major exception (which has just recently
started getting worse, because of the large volumes of connections
coming into the servers).  The problem I am having is that my
active/inactive connections are not being listed properly.  What I
mean, is that the counter for my active/inactive connections just keep
going up and up, and are constantly being skewed.  I read through a
good number of archived messages on this mailing list, and I keep
seeing everyone saying "Those numbers ipvsadm are showing, are just
for reference, they don't really mean anything, don't worry about
them."   Well, I can tell you first hand, when you use wlc (weighted
least connections), those number obviously DO mean something.  My
machines are no longer being equally balanced between because my
connection counts are off, and this is really effecting the
performance of my email servers.  When running "ipvsadm -lcn", I can
see connections with the CLOSE state going from 00:59 to 00:01, and
then magically going back to 00:59 again for no reason.  The same
holds true for ESTABLISHED connections, I see them go from 29:59 to
00:01 and then back to 29:59, and I know for a fact that the
connection from the client has ended.

I seem to recall a bug relating to connection entries having
the behaviour you describe above due to a race in reference counting.
Which version of the kernel do you have? Is there any chance of updating
it to something like 2.6.18?

I'm using a stock Debian Sarge kernel (2.6.8-2-686-smp), I can definitely build the latest kernel, and if you feel that it will help then I'll do that. It's always risky making a major kernel change on a production machine, which is why I wanted to hold off from making that change until someone else familiar with IPVS, felt that it might help.


I'm currently using "IP Virtual Server version 1.2.0", and I know that
there is a 1.2.1 version available, but my problem is that my email
servers are in a production environment, and I really don't want to
recompile a new kernel with the latest IPVS if that isn't going to
solve the problem.  I'd hate to cause other problems with my system
because of a major kernel upgrade.

I can only hope that someone has some suggestions, I am a firm
supporter of IPVS, and as I said I've been using it for 2 years now
and one of my email servers handles over 30,000,000 emails in one
month (or almost 1 million emails a day).  So we heavily relying on
IPVS.  There is another department in our organization that spent
thousands of dollars on FoundryNet load balancing productions, and
I've been able to accomplish the same tasks (and handle a higher load)
by using IPVS, so clearly IPVS is a solid product.  Unfortunately, I
just really need to figure out what is going on with the connection
count problems.

I not sure what information you guys need, but here's some info about
my setup.  If you need any more details, feel free to ask.

6 Dell PowerEdge SC1425
Dual Xeon 3.06Ghz processors
2GB DDR
160GB SATA
Running Debian Sarge

1 machine is the director, the other 5 are the real servers.  All 6
machines are on the same subnet (with public IPs), and the director is
using LVS-DR for load balancing.  Just to give you an idea as to the
types of connection numbers
I'm getting:
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  vip.address.here:smtp wlc
    -> realserver1.ip.here:smtp     Route   50     648        2357
    -> realserver2.ip.here:smtp     Route   50     650        2231
    -> realserver3.ip.here:smtp     Route   50     648        2209
Whereas when using LVS-NAT (which was 100% perfect), my numbers would be
something like:
    -> realserver1.ip.here:smtp     Route   50     16        56
    -> realserver2.ip.here:smtp     Route   50     14        50
    -> realserver3.ip.here:smtp     Route   50     15        48

I assume that the dumps above are for similar traffic rates.

Yes, almost identical traffic rates as compared to the mail logs on the servers for incoming email traffic.


I am wondering if the problem is that for some reason the
linux-directors are not seeing the part of the close sequence
that is sent by the end-user (it won't see the portion sent by
the real-servers). Supposing for a minute that this is the case,
it would explain the strange numbers, and those strange numbers
will be effecting how wlc allocates connections.

But shouldn't IPVS timeout? I thought that was the purpose of the timeouts... So that when the director doesn't see a close event after a specified period of time, it simply times out.


I use keepalived to manage the director and to monitor the real
servers. The only "tweaking" that I've done to IPVS, is I have to run
this:
  /sbin/ipvsadm --set 1800 0 0
before starting up keepalived, just so that the active connections
will stay active for 30 minutes.  In other words, we allow our users
to idle their connection for 30 minutes, and after that, then the
connection should be terminated.  And I put "0 0" there, because from
what I've read, that tells ipvsadm to not change those other two
values (in other words, leave the defaults as is).

That's about all I can think of, the only other wierd thing that I had
to do was to tweak some networking settings on the real servers to fix
the pain-in-the-@$$ ARP issues that come with DR.  But I doubt those
changes would have anything to do with the director's load balancing
problems. Those tweaks were only done on the real servers, and they
were to just silence the broadcasting of the MAC address for the VIP
(dummy0) interfaces on the real servers.

How exactly did you deal with ARP, there are several methods.

On the real servers, I'm first bringing up the dummy0 interface with the VIP, then I use "sysctl" and set the following:
  net.ipv4.conf.dummy0.rp_filter=0
  net.ipv4.conf.dummy0.arp_ignore=1
  net.ipv4.conf.dummy0.arp_announce=2
Then I bring up eth0 with the real server's regular IP address, and with "sysctl", I set the following (includes a repeat of the above options):
  net.ipv4.conf.default.rp_filter=0
  net.ipv4.conf.all.rp_filter=0
  net.ipv4.conf.lo.rp_filter=0
  net.ipv4.conf.dummy0.rp_filter=0
  net.ipv4.conf.eth0.rp_filter=0

  net.ipv4.conf.default.arp_ignore=1
  net.ipv4.conf.all.arp_ignore=1
  net.ipv4.conf.lo.arp_ignore=1
  net.ipv4.conf.dummy0.arp_ignore=1
  net.ipv4.conf.eth0.arp_ignore=1

  net.ipv4.conf.default.arp_announce=2
  net.ipv4.conf.all.arp_announce=2
  net.ipv4.conf.lo.arp_announce=2
  net.ipv4.conf.dummy0.arp_announce=2
  net.ipv4.conf.eth0.arp_announce=2

The ARP problem was the one thing that kept me from moving to LVS-DR for a long time. I finally started playing with all of the net.ipv4.conf options and bringing up the interfaces in a specific order, and finally stumbled across a method that actually worked. I'm sure some of the above options don't need to be set, but it finally works, and I'm a little afraid to touch it.

I'm going to try and build the latest 2.6.18 now, and hopefully sometime later this week I can install the new kernel and reboot our director. Unfortunately I've never been able to get keepalived to handle a MASTER/SLAVE director properly, so I only have one director in front of the real servers, so if I make a mistake, our main university email server will be down.

Thanks for your help!

Craig




--
Horms
 H: http://www.vergenet.net/~horms/
 W: http://www.valinux.co.jp/en/

_______________________________________________
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://www.in-addr.de/mailman/listinfo/lvs-users



     +---------------------------+-------------------------------------+
     | Craig Hynes               | Systems Programmer/Administrator    |
     | master@xxxxxxxxxxxxxxxxxx | Rutgers Camden Computing Services   |
     | (856) 225-2668            | http://computing.camden.rutgers.edu |
     +---------------------------+-------------------------------------+

<Prev in Thread] Current Thread [Next in Thread>