Hello Ratz,
On Thu, 22 Feb 2001, Roberto Nibali wrote:
> > > > Total 1 methods to add new separated features (may be I'm
> > > > missing
> > > > something). The things can be very complex if one new feature wants
> > > > to touch some parts of the functions in the fast path or in the user
> > > > space structures. What can be the solution? Putting hooks inside LVS?
> > >
> > > Yes, but I don't think Wensong likes that idea :)
> >
> > Because this idea is not clear :)
>
> Maybe. But I see that the defense_level is triggered via a sysctrl
> and invoked in the sltimer_handler as well as the *_dropentry. If
> we push those functions on level higher and introduce a metalayer
> that registers the defense_strategy which would be selectable via
> sysctrl and would currently contain update_defense_level we had the
> possibility to register other defense strategies like f.e. limiting
> threshold. Is this feasible? I mean instead of calling update_defense\
> _level() and ip_vs_random_dropentry() in the sltimer_handler we just
> call the registered defense_strategy[sysctrl_read] function. In the
> existing case the defense_strategy[0]=update_defense_level() which
> also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)
The different strategies work in different places and it is
difficult to use one hook. The current implementation allows they to
work together. But may be there is another solution considering how
LVS is called: to drop packets or to drop entries. There are no many
places for such hooks, so may be it is possible something to be done.
But first let's see what kind of other defense strategies will come.
> > > Yes, the project got larger and more reputation than some of us
> > > initially thought. The code is very clear and stable, it's time
> > > to enhance it. The only very big problem that I see is that it
> > > looks like we're going to have to separate code paths one patch
> > > for 2.2.x kernels and one for 2.4.x.
> >
> > Yes, this is the reality. We can try to keep the things not
> > to look different for the user space.
>
> This would be a pain in the ass if we had two ipvsadm. IMHO the
> userspace tools should recognize (compile-time) what kernel it
> is working with and therefore enable the featureset. This will
> of course bloat it up in future the more feature-differences we
> will have regarding 2.2.x and 2.4.x series.
Not possible, the sockopt are different in 2.4
> Could you point me to a sketch where I could try to see how the
> control path for a packet looks like in kernel 2.4? I mean some-
> thing like I would do for 2.2.x kernels:
>
> ----------------------------------------------------------------
> | ACCEPT/ lo interface |
> v REDIRECT _______ |
> --> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ -->
> h a |input | e {Routing } |Chain | |output |ACCEPT
> e n |Chain | m {Decision} |_______| --->|Chain |
> c i |______| a ~~~~~~~~ | | ->|_______|
> k t | s | | | | |
> s y | q | v | | |
> u | v e v DENY/ | | v
> m | DENY/ r Local Process REJECT | | DENY/
> | v REJECT a | | | REJECT
> | DENY d --------------------- |
> v e -----------------------------
> DENY
Here is some info I maintain (may be not actual, the new ICMP
hooks are missing). Look for "LVS" where is LVS placed.
The Netfilter hooks:
Priorities:
NF_IP_PRI_FIRST = INT_MIN,
NF_IP_PRI_CONNTRACK = -200,
NF_IP_PRI_MANGLE = -150,
NF_IP_PRI_NAT_DST = -100,
NF_IP_PRI_FILTER = 0,
NF_IP_PRI_NAT_SRC = 100,
NF_IP_PRI_LAST = INT_MAX,
PRE_ROUTING (ip_input.c:ip_rcv):
CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in
MANGLE=-150, iptable_mangle.c:ipt_hook
NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn
FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect
FILTER+1=1, net/sched/sch_ingress.c:ing_hook
LOCAL_IN (ip_input.c:ip_local_deliver):
FILTER=0, iptable_filter.c:ipt_hook
LVS=100, ip_vs_in
LAST-1, ip_fw_compat.c:fw_confirm
CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm
FORWARD (ip_forward.c:ip_forward):
FILTER=0, iptable_filter.c:ipt_hook
FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out,
masquerade
LVS=100, ip_vs_out
LOCAL_OUT (ip_output.c):
CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local
MANGLE=-150, iptable_mangle.c:ipt_local_out_hook
NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn
FILTER=0, iptable_filter.c:ipt_local_out_hook
POST_ROUTING (ip_output.c:ip_finish_output):
FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect,
mangle ICMP replies
LVS=NAT_SRC-1, ip_vs_post_routing
NAT_SRC=100, ip_nat_standalone.c:ip_nat_out
CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag
CONNTRACK:
PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING
FILTER:
LOCAL_IN, FORWARD, LOCAL_OUT
MANGLE:
PRE_ROUTING, LOCAL_OUT
NAT:
PRE_ROUTING, LOCAL_OUT, POST_ROUTING
Running variants:
1. Only lvs - the fastest
2. lvs + ipfw NAT
3. lvs + iptables NAT
Where is LVS placed:
LOCAL_IN:100 ip_vs_in
FORWARD:100 ip_vs_out
POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing
The chains:
The out->in LVS packets (for any forwarding method) walk:
pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING
LOCAL_IN
ip_vs_in -> ip_route_output/dst cache
-> set skb->nfmark with special value
-> ip_send -> POST_ROUTING
POST_ROUTING
ip_vs_post_routing
- check skb->nfmark and exit from the
chain
The in->out LVS packets (for LVS/NAT) walk:
pre_routing -> FORWARD -> POST_ROUTING
FORWARD
ip_vs_out -> NAT -> NF_ACCEPT
POST_ROUTING
ip_vs_post_routing
- check skb->nfmark and exit from the
chain
I hope in the netfilter docs there is a nice ascii diagram.
But I hope the above info is more useful if you already know what
means each hook.
> > > The biggest problem I see here is that maybe the user space daemons
> > > don't get enough scheduling time to be accurate enough.
> >
> > That is definitely true. When the CPU(s) are busy
> > transferring packets the processes can be delayed. So, the director
> > better not spend many cycles in user space. This is the reason I
> > prefer all these health checks to run in the real servers but this
> > is not always good/possible.
>
> No, considering the fact that not all RS are running Linux. We would
> need to port the healthchecks to every possible RS architecture.
Yes, this is a drawback.
> > > Tell me, which scheduler should I take? None of the existing ones
> > > gives me good enough results currently with persistency. We have
> > > to accept the fact, that 3-Tier application programmers don't
> > > know about loadbalancing or clustering, mostly using Java and this
> > > is just about the end of trying to load balance the application
> > > smoothly.
> >
> > WRR + load informed cluster software. But I'm not sure in
> > the fact the persistency can do very bad things. May be yes, but
> > for small number of clients or when wlc is used (we don't talk for
> > the other dumb schedulers).
>
> I currently get some values via an daemon coded in perl on the RS,
> started via xinetd. The LB connects to the healthcheck port and
> gets some prepared results. He then puts this stuff into a db and
> starts calculating the next steps to reconfigure the LVS-cluster to
> smoothen the imbalance. The longer you let it running the more data
> you get and the less adjustments you have to make. I reckon some
> guy showing up on this list once had this idea in direction of
> fuzzy logic. Hey Julian, maybe we should accept the fact that the
> wlc scheduler also isn't a very advanced one:
> loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);
> What would you think would change if we made this 50 dynamic?
Not sure :) I don't have results from experiments with wlc :)
You can put it in /proc and to make different experiments, for example :)
But warning, ip_vs_wlc can be module, check how lblc* register /proc
vars.
> Later,
> Roberto Nibali, ratz
Regards
--
Julian Anastasov <ja@xxxxxx>
|