LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: memory use on long persistent connection (eg for e-commerce sites,

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: memory use on long persistent connection (eg for e-commerce sites, squids)
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Fri, 20 Sep 2002 14:59:29 +0200
Hi,

2 on topic questions at the bottom honest !

;)

Its not to bad I use NT2k built in Service checker that restarts IIS on failure. (normaly takes about 20 seconds) Server re-boots are v.rare.

But from what I can see about your CISCO LD timing this is not the effective downtime of the server. Even if IIS restarts within 20s, the LD will not forward requests to the server for another 40s, right?

BTW, are you using your CLD in the NAT or the triangulation mode? In the NAT mode there is a possibility of up to 10% packet loss under certain circumstances (maybe they've fixed it since I've tested it back in 2000).

Current live site has a CISCO 416 local director in front of it that detects 8 failures (ACKS ?) in a row then takes the real server out of action for 1 minute + (until it comes back online)

Ugh, what about network congestion?

Sometimes however IIS crashes but still manages to respond to HTTP GET requests (CISCO say this is a bug with IIS) and then the CISCO can't detect the real server failure...

Cisco has to give you that answer because they can't fix it. And the reason is very simple. The CLD is a hardware load balancer and thus can only be equipped with very simplistic healthchecks such as ping, TCP connect, HTTP GET, RADIUS, POP/IMAP and some other protocols/services. But every 'bigger' site has content specific data that is most of the time created dynamically via a DB calls and whatnot. Now it can happen that IIS crashes but HTTP GET still works either because it is in the cache handler of the IIS (which didn't crash) or because this page is not affected by the crash somehow.

For such cases you need sophisticated healthchecks which can be quite lenghty and complex in their nature. But they assure that everything you want to run does run by doing the appropriate test. Look at it as a automated QA test that verifies that noone (not even the process itself) has changed the specifications.

As you can imagine, the space and complexity of what you can implement in ASIC/FPGAs is limited and of course you can't implement a sophisticated healthcheck script into hardware. CISCO can't make it for you and you can't make it either. You're delivered and stuck to the existing healthchecks (which are most of the time accurate enough).

A typical equivalent situation is Oracle. Oracle manages to crash inside but still somehow deliver SNMP or other relevant data. Now if you only have a healthcheck that checks for the correct SNMP values and maybe a SQL manager port connect and successful user/passwd login, you might not recognize that Oracle has crashed because of a memory segment violation somewhere inside it's own wicked world.

That is why the LVS project is such a nice approach from the design point of view. You're not restricted by any hardware issues (unless you need need Gigabit Ethernet and do have this amount of traffic) and you can write your healthcheck in whatever language it pleases you.

Thats why I have LVS / Ldirectord under test at the moment so that I can force it to check for specific page and test result.

Exactly.

1) The only problem I have so far is that all the solutions I've tried for Non Arp interface on real servers seems to knock out windows file sharring (SMB/CIFS) which I use for ROBOCOPY replication of files...

So much to the point of Microsofts own way of defining how things should work starting from L4. No, let's get serious and keep the rant out: I can imagine very well that this happens. Maybe you have to spend a few bucks and put in an additional NIC into your RS and to the ROBOCOPY over those dedicated interfaces. How does that sound?

2) Possible feature request ?
The CISCO has a slow start option i.e. bring the real server back online slowly in order to not overload it..

Ok.

Without this option our real servers will sometimes continously crash as soon as they are brought online 'cause the load is to high...

I've seen this on HP boxes running Netscape servers as well.

I think it's 'cause IIS caches the script first time it runs and this takes about 8 secconds, and if a rush of people ask for a page while the script is compiling then IIS dies.

Yes.

 Could this be made an option in LVS ?

Well there are two ways to achieve it, both of which basically have been implemented or are available by design default:

o threshold limitation (I've made a patch for the 2.2.x kernel series,
  and for the 2.5.x kernels it is already in):
  - You would need to dynamically limit the RS upper threshold up in
    fixed timeslices. This will do the job. Once the server seem to be
    stable, you can remove the RS threshold limitation for it. You have
    to write a script though that does this but this is not very
    difficult. And actually this is a great idea. I've only done this
    for a shutdown procedure but I think I will do it for a startup
    sequence too, using something like a TCP slowstart algorithm, only
    without window notification feedback :)

o Use QoS an a egress policy on the outgoing interface of the load
  balancer to rate limit the incoming (actually outgoing) requests to
  the RS. Also this limit has to be adjusted up in fixed timeslices. A
  similar script as the one above would result.

I know you could probably fake it using WLC and a mod on ldirectord but that doesn't sound the right way to do it.

Exactly. Application level is not the right layer to to this.

BTW did I say thanks for a great product ?

What product are your referring to?

Best regards and I hope this helps,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc



<Prev in Thread] Current Thread [Next in Thread>