Re: [PATCH][RFC]: followup ...

To:	Roberto Nibali <ratz@xxxxxx>
Subject:	Re: [PATCH][RFC]: followup ...
Cc:	"lvs-users@xxxxxxxxxxxxxxxxxxxxxx" <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
From:	Julian Anastasov <ja@xxxxxx>
Date:	Mon, 19 Feb 2001 23:31:33 +0000 (GMT)
        Hello Ratz,

On Mon, 19 Feb 2001, Roberto Nibali wrote:

> Hi Julian,
>
> >         I agree, some firewalling can be done before the balancer
> > but when the normally looking traffic comes only the balancer knows
> > for open/closed ports, related ICMP, etc. The main things
>
> Unless you put a proxying firewall ;)
>
> > you can do before the balancer are to avoid source address spoofing,
> > some bad packets, may be some ICMP types? But the balancer can be
> > attacked even with normal traffic. The request rate can be limited
>
> Ok, Julian, let's make a real example. I try to set up some LVS
> cluster with webserver and a normally configured firewall and you
> try to flood it in a way that the service cannot be delivered
> anymore normally :). Any ISP that wants to give me temporary access
> to his backbone?

        :)

> >         Yes, we need NETLINK_LVS kernel socket or similar. I don't
> > think that for netfilter will be easy but for LVS can be easier. If
>
> The architecture he proposed to me was rather simple, actually he
> had the same idea. He's doing it as a module that hooks into
> conntrack. There you have quite the same template structures for
> incoming connections just more of them ;)

        Hm, not sure how will look the details.

> > we use full state (yes, Netfilter has "Real statefull connection
> > tracking") replication we can flood the the internal links. There
>
> There we'd have to split the LVS-code into two sourcetrees.
> Because doing connection tracking and replication is too much
> to implement in kernel space for 2.2.x.

        We already have two source trees (for 2.2 and 2.4). I don't
see very big difference for the replication requirements in 2.2 and
2.4. For Netfilter the picture can look different and the other
think is that I don't know how the replication is going to be
implemented there.

> > are ideas the state replication to be implemented only for long
> > living connections. And yes, we can use this universal transport
> > for many things, not only for connection state replication.
>
> [OT] I proposed him to make a general framework so that we don't
> have to reinvent the wheel. I though that it should be possible
> to register via a device all the template tables you want to
> have synch'd and the module itself would be responsible to
> create the appropriate NETLINK packets and to start/reset the
> timers in kernel space. [/OT]

        When LVS and Netfilter have different connection tables
we must first make the replication separately and then to see what
is the common code. Or at least to sync.

> >         Yes, the user must select a backlog size value according to the
> > connection rate, we don't want dropped requests even while not under
>
> Oh, this sound very reasonable. How and where do you think this can
> be implemented?

        This can be automated (cotrolled from user space) but I
talked about the simple case where the user looks in /proc/net/netstat
or in the log for generated SYN cookies.

> > attack. Of course, the SYN cookies help, for the OSes that support
> > them. Not very much if our link is full with invalid requests because
> > we can flood our output pipe too. But I don't know how often DDoS
> > SYN attacks happen these days.
>
> It's a O(N^3) proportion to the popularity :) I'd love to see
> the snort logfiles of nasa.gov or nsa.com or some *.mil? Over here
> we have this stupid "big brother" stuff broadcasted trough some
> ACEdirector3 loadbalancers. Two hours after launch the RS were not
> reachable anymore.

        :)

> >         Agreed. drop_packet and RS limits are different things.
> > The question is how efficient will be the RS limits but if they
> > are option the users can select, I don't see a problem. That can
>
> Good. That's what I did, see my example when announing it. ;)
>
> >         Yes, these RS limits are a simple control we can add.
> > And of course it will be used from many users. My doubts are related
> > to the moment where all real server will disappear and will not
> > accept more new connections. How fast we will increase these
>
> I will investigate this. Could you just give me some proposals
> on how to make different test setups, please? With enough time
> I prepare some kernel with different options enabled and will
> do some penetration tests.

        May be testlvs is enough to hit the upper connection
limits for all real servers. And it seems deleting and then adding
the real servers (some LVS user can do that with user space tools) can
lead to higher active/inactive numbers, for example, in LVS-DR.

> > limits or will start scheduling connections to these real servers.
> > It again appears to be a user space problem :)
>
> Yes, this is definitely a user space problem, if you want to
> make it dynamically. I proposed the statical approach. If I
> do it dynamically, we have to introduce some more setsockopts,
> don't we?

        Not sure, isn't the SET_EDITDEST sockopt enough for these two
limits?

> >         Yes but drop_packet can be activated when we see a very
> > big connection rate that will occupy all the memory for connections
> > in the director. If we don't run other user space software we
> > can simply ignore the defense strategies and to leave the packets
> > to be dropped after memory allocation error.
>
> I have no experiences with this approach. Do I understand you
> correctly when I say: The defense level is set by the amount
> of kmalloc'able pages in the kernel per skb?

        Yes, currently LVS uses the free memory value as key for
manipulating the defense strategies. No skbs involved or I don't
understand the question.

> >         Yes, may be we can imlpement a better mechanism that will
> > allow the different options to be supported without hurting all
> > users. Who knows, may be we can create more sockops? But the
>
> Isn't that the case right now? The provided function of ipvsadm
> is very sparse.

        Yes, in 2.4 they are many but in 2.2 it is one. It seems
in 2.4 it is more easy to add more sockopts.

> > > So the distributions can handle it. It can't be our task to
> > > adjust the binary tool to every distro it's our task to keep
> > > it clean and independant of any distro.
> >
> >         This is true but it means thay have to put all features in?
>
> No exactly, if there is a framework proposed by some distributor
> that can be of use for everyone and that doesn't affect the rest
> of the flow of LVS it should possible to include it.
>
> > Currently, for LVS we have the following methods in hand:
> >
> > - create new scheduler
>
> I could think of a method for "defense strategies". Do you know about
> the OOM-killer framework for kernel-2.4.x? There we have a general
> hook like for creating a new scheduler and everybody that thinks he
> has a great idea to improve the functionality of the structure can
> add his code (like f.e. Thomas Proell did with the hashing scheduler).
> A lot of people already proposed some patches for the OOM-killer and
> so I could imagine a hook into LVS where you can register your own
> defense strategy, so we can test them under different penetration
> tests.

        What to answer, we have to analyze every case separately
because it can touch many parts from the code. Not sure whether
the current structure allows such hooks.

> >         Total 1 methods to add new separated features (may be I'm missing
> > something). The things can be very complex if one new feature wants
> > to touch some parts of the functions in the fast path or in the user
> > space structures. What can be the solution? Putting hooks inside LVS?
>
> Yes, but I don't think Wensong likes that idea :)

        Because this idea is not clear :)

> > IMO, we already must think for such needs.
>
> Yes, the project got larger and more reputation than some of us
> initially thought. The code is very clear and stable, it's time
> to enhance it. The only very big problem that I see is that it
> looks like we're going to have to separate code paths one patch
> for 2.2.x kernels and one for 2.4.x.

        Yes, this is the reality. We can try to keep the things not
to look different for the user space.

> >         No doubts, there will be some nice features that can't be
> > done in user space. And exactly these features are not used from
> > other users. The example is the cp->fwmark support proposed from
> > Henrik Nordstrom: we have a feature that is difficult to say it
> > is for user space but that touches two parts: internal functions
> > and adds another hook that can delay the processing for some
>
> The problem with his patch is:
>  static struct nf_hook_ops ip_vs_in_ops = {
>          { NULL, NULL },
> -        ip_vs_in, PF_INET, NF_IP_LOCAL_IN, 100
> +        ip_vs_in, PF_INET, NF_IP_LOCAL_IN, -10
> +};

        Yes, we discussed it. But this is another issue, not
related to the cp->fwmark support. Of course, the users have to
choose how they will use fwmark: for routing, for QoS, for fwmark
based services. What happens when you want to do QoS and use fwmark
to classify the input traffic with different fwmarks but to
return the traffic using source routing. Currently, the LVS and the
MASQ code can't do source routing for the outgoing traffic (after
NAT in the in->out direction) - ip_forward.c:ip_forward() is not
ready for such games. LVS is ready because we know what will be
the saddr after the packet is masqueraded. This is for 2.2.
In 2.4 the cp->fwmark can help but this is not a complete and
universal solution. May be there is another solution and LVS can
use the route functions to select the right outdev and gateway
after the packet is masqueraded. May be we can simply change
skb->dst (in 2.2) and ip_forward to call ip_send with the
right (new) device and to forward the packet to the right gw?

> > users. I'm not sure what will happen if we start to think in
> > "hooks" just like netfilter. If that looks good in user space
> > I'm not sure we can tell the same for the kernel space. Any
> > ideas here, may be for new topic?
>
> See above about hooks for defense strategies. But you're right
> IMHO, there is not a lot you can put into kernel space since
> most of the stuff has to be done in userspace.
>
> >         No, counter which is reset on state change. But this is
> > another issue and I didn't started to think more about such things.
> > May be will not :)
>
> Isn't that the case for 2.4.x and conntrack already?

        May be yes. But LVS has separate conntracking.

> >         Yes, that defense can be connection state related, LVS is
> > connection scheduler, though, not a packet scheduler.
>
> Not yet ;)
>
> >         Yes, job for the agents to represent the real server load
> > in weights.
>
> The biggest problem I see here is that maybe the user space daemons
> don't get enough scheduling time to be accurate enough.

        That is definitely true. When the CPU(s) are busy
transferring packets the processes can be delayed. So, the director
better not spend many cycles in user space. This is the reason I
prefer all these health checks to run in the real servers but this
is not always good/possible.

> >         Yes, wlc is not my preferred scheduler when it comes to
> > connections dealing with database :)
>
> Tell me, which scheduler should I take? None of the existing ones
> gives me good enough results currently with persistency. We have
> to accept the fact, that 3-Tier application programmers don't
> know about loadbalancing or clustering, mostly using Java and this
> is just about the end of trying to load balance the application
> smoothly.

        WRR + load informed cluster software. But I'm not sure in
the fact the persistency can do very bad things. May be yes, but
for small number of clients or when wlc is used (we don't talk for
the other dumb schedulers).

> >         I don't think we need intelligent scheduler if we
> > are talking about current set of information used from the LVS
> > schedulers. Only the users know what kind of connections are
> > scheduled and they can instruct an user space tool how to set the
> > WRR weights according to the load.
>
> See, the timeperiod of setting the weights and the resulting load
> rebalance is just a relation of 1:100. If you try to adjust the
> weights dynamically, you will see (for an average e-buiz application
> framework with webserver and database) that you can never balance
> it right in time. The good thing is, that even commercial load
> balancer can't do it.

        Of course, there can be some peaks but we are not going to
react only on the load generated from the client requests. There can
be a load not related to the clients, for example, program allocating
memory or spending some CPU cycles. Such load is not visible to wlc
and dramatic things can happen. Very often some weird things can
happen, for example, with a cgi bins that work with databases. The
simple fact of allocated memory is a bad symptom. Of course,
everything is application specific.

> >         Yes, there are packets with sources from the private networks
> > too :)
>
> They are masqueraded and their netentity belongs to an interface
> which of course will not drop the packets :)

        The problem is that they are not masqueraded :) And they can
reach us. Not every ISP drops spoofed packets at the place they are
generated. But in most of the cases this is not fatal.

> >         I hope other people will express their ideas about this
> > topic. May be I'm too pedantic in some cases :) And now I'm talking
> > without "showing the code" :) I hope the things will change soon :)
>
> No, no, I also hope some other people join the discussion since
> we both could be completely wrong (well, in your case I doubt ...)

        Yep, may be we are just inventing the wheel :))

> Best regards,
> Roberto Nibali, ratz


Regards

--
Julian Anastasov <ja@xxxxxx>
<Prev in Thread]	Current Thread	[Next in Thread>
Re: [PATCH][RFC]: add threshhold per RS (dirty hospital version), Roberto Nibali Re: [PATCH][RFC]: add threshhold per RS (dirty hospital version), Julian Anastasov Re: [PATCH][RFC]: followup ..., Roberto Nibali Re: [PATCH][RFC]: followup ..., Julian Anastasov Re: [PATCH][RFC]: followup ..., Roberto Nibali Re: [PATCH][RFC]: followup ..., Julian Anastasov <= Re: [PATCH][RFC]: followup ..., Roberto Nibali Re: [PATCH][RFC]: followup ..., Julian Anastasov Re: [PATCH][RFC]: followup ..., Henrik Nordstrom Re: [PATCH][RFC]: followup ..., Roberto Nibali Re: [PATCH][RFC]: followup ..., Henrik Nordstrom Re: [PATCH][RFC]: followup ..., Roberto Nibali
Previous by Date:	Re: kernel 2.4.1 with LVS patch & Netfilter, Andy Gussie
Next by Date:	Re: can LVS be run ON the firewall box?, Brian Edmonds
Previous by Thread:	Re: [PATCH][RFC]: followup ..., Roberto Nibali
Next by Thread:	Re: [PATCH][RFC]: followup ..., Roberto Nibali
Indexes:	[Date] [Thread] [Top] [All Lists]