Hi,
[...]
> > There we'd have to split the LVS-code into two sourcetrees.
> > Because doing connection tracking and replication is too much
> > to implement in kernel space for 2.2.x.
>
> We already have two source trees (for 2.2 and 2.4). I don't
> see very big difference for the replication requirements in 2.2 and
> 2.4. For Netfilter the picture can look different and the other
> think is that I don't know how the replication is going to be
> implemented there.
Yes, but until now we had more or less the same functionality for
2.2.x kernel series and for 2.4.x kernels. Now I see it drifting
away. This is not a problem however. I actually also don't know
how Laforge is going to do it but he promised me to send me his
patches as soon as he's got something working.
> When LVS and Netfilter have different connection tables
> we must first make the replication separately and then to see what
> is the common code. Or at least to sync.
Corrrect.
> > > Yes, the user must select a backlog size value according to the
> > > connection rate, we don't want dropped requests even while not under
> >
> > Oh, this sound very reasonable. How and where do you think this can
> > be implemented?
>
> This can be automated (cotrolled from user space) but I
> talked about the simple case where the user looks in /proc/net/netstat
> or in the log for generated SYN cookies.
This may change extremely fast!
> > I will investigate this. Could you just give me some proposals
> > on how to make different test setups, please? With enough time
> > I prepare some kernel with different options enabled and will
> > do some penetration tests.
>
> May be testlvs is enough to hit the upper connection
> limits for all real servers. And it seems deleting and then adding
> the real servers (some LVS user can do that with user space tools) can
> lead to higher active/inactive numbers, for example, in LVS-DR.
Yep, maybe this weekend I might get a test setup ready.
> > > limits or will start scheduling connections to these real servers.
> > > It again appears to be a user space problem :)
> >
> > Yes, this is definitely a user space problem, if you want to
> > make it dynamically. I proposed the statical approach. If I
> > do it dynamically, we have to introduce some more setsockopts,
> > don't we?
>
> Not sure, isn't the SET_EDITDEST sockopt enough for these two
> limits?
Of course.
> > I have no experiences with this approach. Do I understand you
> > correctly when I say: The defense level is set by the amount
> > of kmalloc'able pages in the kernel per skb?
>
> Yes, currently LVS uses the free memory value as key for
> manipulating the defense strategies. No skbs involved or I don't
> understand the question.
Ok, saw it in the code now, and yes, no skbs ;)
> > I could think of a method for "defense strategies". Do you know about
> > the OOM-killer framework for kernel-2.4.x? There we have a general
> > hook like for creating a new scheduler and everybody that thinks he
> > has a great idea to improve the functionality of the structure can
> > add his code (like f.e. Thomas Proell did with the hashing scheduler).
> > A lot of people already proposed some patches for the OOM-killer and
> > so I could imagine a hook into LVS where you can register your own
> > defense strategy, so we can test them under different penetration
> > tests.
>
> What to answer, we have to analyze every case separately
> because it can touch many parts from the code. Not sure whether
> the current structure allows such hooks.
Agreed. Talking about stuff and generating ideas is simple but only
code will show if it's feasible.
> > > Total 1 methods to add new separated features (may be I'm missing
> > > something). The things can be very complex if one new feature wants
> > > to touch some parts of the functions in the fast path or in the user
> > > space structures. What can be the solution? Putting hooks inside LVS?
> >
> > Yes, but I don't think Wensong likes that idea :)
>
> Because this idea is not clear :)
Maybe. But I see that the defense_level is triggered via a sysctrl
and invoked in the sltimer_handler as well as the *_dropentry. If
we push those functions on level higher and introduce a metalayer
that registers the defense_strategy which would be selectable via
sysctrl and would currently contain update_defense_level we had the
possibility to register other defense strategies like f.e. limiting
threshold. Is this feasible? I mean instead of calling update_defense\
_level() and ip_vs_random_dropentry() in the sltimer_handler we just
call the registered defense_strategy[sysctrl_read] function. In the
existing case the defense_strategy[0]=update_defense_level() which
also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)
> > Yes, the project got larger and more reputation than some of us
> > initially thought. The code is very clear and stable, it's time
> > to enhance it. The only very big problem that I see is that it
> > looks like we're going to have to separate code paths one patch
> > for 2.2.x kernels and one for 2.4.x.
>
> Yes, this is the reality. We can try to keep the things not
> to look different for the user space.
This would be a pain in the ass if we had two ipvsadm. IMHO the
userspace tools should recognize (compile-time) what kernel it
is working with and therefore enable the featureset. This will
of course bloat it up in future the more feature-differences we
will have regarding 2.2.x and 2.4.x series.
> Yes, we discussed it. But this is another issue, not
> related to the cp->fwmark support. Of course, the users have to
> choose how they will use fwmark: for routing, for QoS, for fwmark
> based services. What happens when you want to do QoS and use fwmark
> to classify the input traffic with different fwmarks but to
> return the traffic using source routing. Currently, the LVS and the
> MASQ code can't do source routing for the outgoing traffic (after
> NAT in the in->out direction) - ip_forward.c:ip_forward() is not
> ready for such games. LVS is ready because we know what will be
> the saddr after the packet is masqueraded. This is for 2.2.
I believe you that although I don't understand it :)
> In 2.4 the cp->fwmark can help but this is not a complete and
> universal solution. May be there is another solution and LVS can
> use the route functions to select the right outdev and gateway
> after the packet is masqueraded. May be we can simply change
> skb->dst (in 2.2) and ip_forward to call ip_send with the
> right (new) device and to forward the packet to the right gw?
Could you point me to a sketch where I could try to see how the
control path for a packet looks like in kernel 2.4? I mean some-
thing like I would do for 2.2.x kernels:
----------------------------------------------------------------
| ACCEPT/ lo interface |
v REDIRECT _______ |
--> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ -->
h a |input | e {Routing } |Chain | |output |ACCEPT
e n |Chain | m {Decision} |_______| --->|Chain |
c i |______| a ~~~~~~~~ | | ->|_______|
k t | s | | | | |
s y | q | v | | |
u | v e v DENY/ | | v
m | DENY/ r Local Process REJECT | | DENY/
| v REJECT a | | | REJECT
| DENY d --------------------- |
v e -----------------------------
DENY
> > The biggest problem I see here is that maybe the user space daemons
> > don't get enough scheduling time to be accurate enough.
>
> That is definitely true. When the CPU(s) are busy
> transferring packets the processes can be delayed. So, the director
> better not spend many cycles in user space. This is the reason I
> prefer all these health checks to run in the real servers but this
> is not always good/possible.
No, considering the fact that not all RS are running Linux. We would
need to port the healthchecks to every possible RS architecture.
> > Tell me, which scheduler should I take? None of the existing ones
> > gives me good enough results currently with persistency. We have
> > to accept the fact, that 3-Tier application programmers don't
> > know about loadbalancing or clustering, mostly using Java and this
> > is just about the end of trying to load balance the application
> > smoothly.
>
> WRR + load informed cluster software. But I'm not sure in
> the fact the persistency can do very bad things. May be yes, but
> for small number of clients or when wlc is used (we don't talk for
> the other dumb schedulers).
I currently get some values via an daemon coded in perl on the RS,
started via xinetd. The LB connects to the healthcheck port and
gets some prepared results. He then puts this stuff into a db and
starts calculating the next steps to reconfigure the LVS-cluster to
smoothen the imbalance. The longer you let it running the more data
you get and the less adjustments you have to make. I reckon some
guy showing up on this list once had this idea in direction of
fuzzy logic. Hey Julian, maybe we should accept the fact that the
wlc scheduler also isn't a very advanced one:
loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);
What would you think would change if we made this 50 dynamic?
> Of course, there can be some peaks but we are not going to
> react only on the load generated from the client requests. There can
Of course not, this is just an additional factor in calculating the
next steps for the choice of the RS for delivery.
> be a load not related to the clients, for example, program allocating
> memory or spending some CPU cycles. Such load is not visible to wlc
> and dramatic things can happen. Very often some weird things can
> happen, for example, with a cgi bins that work with databases. The
> simple fact of allocated memory is a bad symptom. Of course,
> everything is application specific.
That's the challenge.
> > > Yes, there are packets with sources from the private networks
> > > too :)
> >
> > They are masqueraded and their netentity belongs to an interface
> > which of course will not drop the packets :)
>
> The problem is that they are not masqueraded :) And they can
> reach us. Not every ISP drops spoofed packets at the place they are
> generated. But in most of the cases this is not fatal.
Broken network design, IMO.
> > > I hope other people will express their ideas about this
> > > topic. May be I'm too pedantic in some cases :) And now I'm talking
> > > without "showing the code" :) I hope the things will change soon :)
> >
> > No, no, I also hope some other people join the discussion since
> > we both could be completely wrong (well, in your case I doubt ...)
>
> Yep, may be we are just inventing the wheel :))
We'll see ...
Later,
Roberto Nibali, ratz
--
mailto: `echo NrOatSz@xxxxxxxxx | sed 's/[NOSPAM]//g'`
|