Hi,
2 on topic questions at the bottom honest !
;)
Its not to bad I use NT2k built in Service checker that restarts IIS on
failure. (normaly takes about 20 seconds) Server re-boots are v.rare.
But from what I can see about your CISCO LD timing this is not the
effective downtime of the server. Even if IIS restarts within 20s, the
LD will not forward requests to the server for another 40s, right?
BTW, are you using your CLD in the NAT or the triangulation mode? In the
NAT mode there is a possibility of up to 10% packet loss under certain
circumstances (maybe they've fixed it since I've tested it back in 2000).
Current live site has a CISCO 416 local director in front of it that
detects 8 failures (ACKS ?) in a row then takes the real server out of
action for 1 minute + (until it comes back online)
Ugh, what about network congestion?
Sometimes however IIS crashes but still manages to respond to HTTP GET
requests (CISCO say this is a bug with IIS) and then the CISCO can't
detect the real server failure...
Cisco has to give you that answer because they can't fix it. And the
reason is very simple. The CLD is a hardware load balancer and thus can
only be equipped with very simplistic healthchecks such as ping, TCP
connect, HTTP GET, RADIUS, POP/IMAP and some other protocols/services.
But every 'bigger' site has content specific data that is most of the
time created dynamically via a DB calls and whatnot. Now it can happen
that IIS crashes but HTTP GET still works either because it is in the
cache handler of the IIS (which didn't crash) or because this page is
not affected by the crash somehow.
For such cases you need sophisticated healthchecks which can be quite
lenghty and complex in their nature. But they assure that everything you
want to run does run by doing the appropriate test. Look at it as a
automated QA test that verifies that noone (not even the process itself)
has changed the specifications.
As you can imagine, the space and complexity of what you can implement
in ASIC/FPGAs is limited and of course you can't implement a
sophisticated healthcheck script into hardware. CISCO can't make it for
you and you can't make it either. You're delivered and stuck to the
existing healthchecks (which are most of the time accurate enough).
A typical equivalent situation is Oracle. Oracle manages to crash inside
but still somehow deliver SNMP or other relevant data. Now if you only
have a healthcheck that checks for the correct SNMP values and maybe a
SQL manager port connect and successful user/passwd login, you might not
recognize that Oracle has crashed because of a memory segment violation
somewhere inside it's own wicked world.
That is why the LVS project is such a nice approach from the design
point of view. You're not restricted by any hardware issues (unless you
need need Gigabit Ethernet and do have this amount of traffic) and you
can write your healthcheck in whatever language it pleases you.
Thats why I have LVS / Ldirectord under test at the moment so that I can
force it to check for specific page and test result.
Exactly.
1) The only problem I have so far is that all the solutions I've tried
for Non Arp interface on real servers seems to knock out windows file
sharring (SMB/CIFS) which I use for ROBOCOPY replication of files...
So much to the point of Microsofts own way of defining how things should
work starting from L4. No, let's get serious and keep the rant out: I
can imagine very well that this happens. Maybe you have to spend a few
bucks and put in an additional NIC into your RS and to the ROBOCOPY over
those dedicated interfaces. How does that sound?
2) Possible feature request ?
The CISCO has a slow start option i.e. bring the real server back
online slowly in order to not overload it..
Ok.
Without this option our real servers will sometimes continously
crash as soon as they are brought online 'cause the load is to high...
I've seen this on HP boxes running Netscape servers as well.
I think it's 'cause IIS caches the script first time it runs and this
takes about 8 secconds, and if a rush of people ask for a page while the
script is compiling then IIS dies.
Yes.
Could this be made an option in LVS ?
Well there are two ways to achieve it, both of which basically have been
implemented or are available by design default:
o threshold limitation (I've made a patch for the 2.2.x kernel series,
and for the 2.5.x kernels it is already in):
- You would need to dynamically limit the RS upper threshold up in
fixed timeslices. This will do the job. Once the server seem to be
stable, you can remove the RS threshold limitation for it. You have
to write a script though that does this but this is not very
difficult. And actually this is a great idea. I've only done this
for a shutdown procedure but I think I will do it for a startup
sequence too, using something like a TCP slowstart algorithm, only
without window notification feedback :)
o Use QoS an a egress policy on the outgoing interface of the load
balancer to rate limit the incoming (actually outgoing) requests to
the RS. Also this limit has to be adjusted up in fixed timeslices. A
similar script as the one above would result.
I know you could probably fake it using WLC and a mod on ldirectord but
that doesn't sound the right way to do it.
Exactly. Application level is not the right layer to to this.
BTW did I say thanks for a great product ?
What product are your referring to?
Best regards and I hope this helps,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc
|