LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [PATCH][RFC]: add threshhold per RS (dirty hospital version)

To: Julian Anastasov <ja@xxxxxx>
Subject: Re: [PATCH][RFC]: add threshhold per RS (dirty hospital version)
Cc: "lvs-users@xxxxxxxxxxxxxxxxxxxxxx" <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
From: Roberto Nibali <ratz@xxxxxx>
Date: Thu, 15 Feb 2001 10:13:37 +0100
Hi Julian,

[Back from hospital]

> > What is it good for? Yeah well, I don't know exactly, imagine yourself,
> > but first of all this is proposal and I wanted to ask for a discussion
> > about a possible inclusion of such a feature or even a derived one into
> > the main code (of course after fixing the race conditions and bugs and
> > cleaning up the code) and second, I found out with tons of talks with
> > customers that such a feature is needed, because also commercial lb
> > have this and managers always like to have a nice comparision of all
> > features to decide which product they take. Doing all this in user-
> > space is unfortunately just not atomic enough.
> 
>         Yes, greping for the connection counters from the /proc fs
> is not good :)

Don't laugh, I've been doing this lately. On a heavy loaded server
you get a shift of +-150 connections. :)
 
>         How we will defend against DDoS? Using the number of active

I'm using a packetfilter and in special zones a firewall after the
packetfilter ;) No seriously, I personally don't think the LVS should
take too much part on securing the realservers It's just another part
of the firewall setup.

> or inactive connections to assign a new weight is _very_ dangerous.

I know, but OTOH, if you set a threshhold and my code takes the
server out, because of a well formated DDoS attack, I think it
is even better than if you would allow the DDoS and maybe kill the
realservers http-listener. BTW, what if you enable the defense
strategies of the loadbalancer? I've done some tests and I was
able to flood the realservers by sending forged SYNs and timeshifted
SYN-ACKs with the expected seq-nr. It was impossible to work on
the realservers unless of course I enabled the TCP_SYNCOOKIES. I
then enabled my patch and after the connections exceeded the 
threshhold, the kernel took the server out temporarily by setting
the weight to 0. In that way the server was usable and I could
work on the server.

> In theory, the number of connections is related to the load but
> this is true when the world is ideal. The inactive counter can
> be set with very high values when we are under attack. Even the WLC
> method loads proporcionally the real servers but they are never
> excluded from operation.

True, but as I already said. I think LVS shouldn't replace a fw.
I normally have a router configured properly, then a packetfilter,
then a firewall or even another but stateful packetfilter. See,
the patch itself is not even mandatory. I normal setup, my code
is not even touched (except the ``if'':).  
 
>         I have some thoughts about limiting the traffic per
> connection but this idea must be analyzed. The other alternatives

Hmm, I just want to limit the amount of concurrent connections
per realserver and in the future maybe per service. This saved
me quite some lines of code in my userspace healthchecking 
daemon.

> are to use the Netfilter's "limit" target or QoS to limit the
> traffic to the real servers.

But then you have to add quite some code. The limit target has
no idea about LVS tables. How should this work, f.e. if you 
would like to rate limit the amount of connections to a realserver?
 
> > I already implemented a dozen of such setups and they work all
> > pretty well.
> 
>         Let's analyze the problem. If we move new connections from
> "overloaded" real server and redirect them to the other real servers we
> will overload them too. IMO, the problem is that there are

No, unless you use a old machine. This is maybe a requirement of
an e-commerce application. They have some servers and if the servers
are overloaded (taken out by my user-space healthchecking daemon
because the response time it to high or the application daemon is
not listening anymore on the port) they will be taken out. Now I
have found out that by setting threshholds I could reduce the down-
time of flooded server significantly. In case all servers were
taken out or their weights were set to 0 the userspace application
sets up a temporarily (either local route or another server) new
realserver that has nothing else to do then pushing a static webpage
saying that the service is currently unavailable due to high 
server load or DDoS attack or whatever. Put this page behind a 
TUX 2.0 and try to overflow it. If you can, apply the zero-copy
patches of DaveM. No way you will find such a fast (88MBit/s 
requests!!) Link to saturate the server.

> more connection requests than the cluster can handle. The solutions
> to try to move the traffic between the real servers can only cause
> more problems. If at the same time we set the weights to 0 this
> leads to more delay in the processing. May be more useful is to
> start to reduce the weights first but this again returns us to
> the theory for the smart cluster software.

Mhh, I was thinking of this first. But do you think a cluster 
software is fast (atomic) enough to react to huge amount of
requests?
 
>         So, we can't exit from this situation without dropping
> requests. There is more traffic that can't be served from the cluster.

Obviously yes, but if you also include the practical problem of
SLA with customers and guaranteed downtime per month I still have
to say that for my deploition (is this the correct noun?) I go
better with my patch in case of a DDoS and enabled LVS defense
strategies then without.

> If there is no cluster software to keep the real servers equally
> loaded, some of them can go offline too early.

The scheduler should keep them equally loaded IMO even in case
of let's say 70% forged packets. Again, if you don't like to 
set a threshold, leave it. The patch is open enough. If you like
to set it, set it, maybe set it very high. It's up to you.
[See, I'm still fighting, although you're arguments are better :)]
 
>         The cluster software can take the role to monitor the load
> instead of relying on the connection counters. I agree, changing the

I think this has to be done additionally.

> weights and deciding how much traffic to drop can be explained
> with a complex formula. But I can see it only as a complete solution:

Agreed. Well, I come over to your place and we discuss this whole
thing :)

> to balance the load and to drop the exceeding requests, serve as many
> requests as possible. Even the drop_packet strategy can help here,

See, not too many tests have been made yet. Most of it by Joe and
not all of them reflect the real world.

> we can explicitly enable it specifying the proper drop rate. We don't

Again, additionally.

> need to use it only to defend the LVS box but to drop the exceeding
> traffic. But someone have to control the drop rate :) If there is no

A packetfilter, the router (most of use do have a CISCO, don't the?)

> exceeding traffic what problems we can expect? Only from the bad load
> balancing :)

Null pointer dereferences :)

Thank you Julian for having had a look at it and for the interesting
points. I'm sure in future we will find a way to make all of us happy.

Best regards,
Roberto Nibali, ratz

-- 
mailto: `echo NrOatSz@xxxxxxxxx | sed 's/[NOSPAM]//g'`


<Prev in Thread] Current Thread [Next in Thread>