Hi Henrik,
I'm not sure if I can, but if we're lucky, our management agrees, that
it's necessary to get some resources from our software developers - but
I can't promise.
No problem, Julian is a monster coder, he can handle it :)
Timeframe (backwards):
We want to have a system in production till the end of the year, which
includes testing.
Regression tests or just functionality tests?
In May we want to know in detail what we will do.
Feasable.
By the end of this month (maybe first week in march) we have to make the
decision: LVS or not, cluster or not, Linux or not ... for that we must
have some kind of plan, which looks like it could work :-)
I hope so.
I'm not sure about the shell account, but I'd rather say no - sorry,
this is not in my hands... maybe we can find another solution like a
defined Interface.
No problem, Julian has a test setup. This should be enough. Most of the
time he only needs to compile the code (and it works!) as a test.
> Oh, I'm not so sure if we can do state table synchronisation with
> ESP/AH hashed table entries, unless we find out how the timeout and
> such work. Maybe it's enough to synchronise the SDP pool.
sorry - what is the SDP pool
It's the security policy database on each side of the IPsec endpoint. It
is a part of KLIPS and has information to build the SA's.
Lets say, many customers will use it :-)
Ok.
ok, it's a little bit more than that, maybe the word Traffic Controler
would be more accurate - we know what tunnels are possible
and we know what tunnels are active. So we *have* a reflection of the
active state, and also have an idea of how much traffic will come from
which IPs.
I still wonder how this is possible. How can you know how much traffic
you get from which IP in advance? Or is this yet another English
understanding problem of mine?
Importatnt is, that when a tunnel from IP 192.168.123.234 comes up, we
already know, it will tunnel e.g. 143.193.40.32/28 AND what is more
important, if a packet for IP 143.193.40.36 comes from the 'target
subnsts' we already know, it will go through an IPSec tunnel to
192.168.123.234.
And more than that, 192.168.123.234 has to tell us, before it cans send
anything, so we know 'the tunel will be built in a few seconds'.
That needs some damn intelligent software which needs to be more HA then
the rest of the framework because if your system fails, no customer will
be able to connect (despite the tunnels might be working), right?
how many connections can persistency handle? 1.000 'IPSec terms' is a
starting number :-) Assume an average number of hosts behind an 'IPSec
term' to be about 32. So we have 32.000 hosts. This is to be multiplied
by the average number of connections/host (for the Director at the
'target subnets' side) ...
This would need, lets calculate ...
32*1000*128 Bytes ~ 4 MBytes. Hmm do you think you can afford that much
RAM for you boxes? :)
That's what you would need on the director to sustain the template
entries in the hash table for LVS_DR. Since those are not per second,
this is pretty much nothing. You can take an old P166 to do that.
It's the same Problem, but from the other direction
target-host1---| |--IPSEC----- many-host1,2,3...
| | decrypt.
target-host2---| IPSec |--IPSec----- many-host46,47,48...
|-- encryption --| decrypt.
... ... ...
| |
Target-hostm---| |--IPSEc----- many-hostn,n+1,n+2...
decrypt
The first problem is right to left, the second one is left to right.
Honestly I don't see a problem. It's just a routing issue. host1
connects via IPsecGW_a1 to VIP which distributes it to one IPsecGW_bX
which will get the routing information in the ESP packet after
decrypting it and send it there. Of course IPsecGW_bX knows where to
route the back packet after encryption. Maybe I still don't get it. So
let's find out. Could you tell me if your setup would work without a
load balancer just by doing 1:1 IPsecGW_aX <-> IPsecGW_bX assignment?
Because if this works, then it works with LVS_DR.
> Every IPsec term. needs to be able to address all subnets or it defeats
> the purpose of load balancing!
I assume IPSec term is IPSec in the RS.
Yes :) [IPsec term. == IPsec endpoint == IPsecGW_bX == RS]
Then the above statement is true, as long as no tunnel is set up. when
the tunnel for a subnet is up, we have no choice. Tunnels will stay for
hours, days, weeks ...
Julian, do you copy? WEEKS!! This is going to use more RAM eventually.
What is you rekeying time policy? Do you provide an SLA-like agreement
where you say that after 2 weeks of no byte stream we will close the tunnel?
> Well, the problem with this is, that in IPsec tunneling mode, you
simply
> don't have this information about the to-be-routed subnet in the non
> decrypted part of the IP packet. Read my email exchange with Julian. In
> tunneling mode you only have the daddr which is equal to the VIP and
> only the IPsec term. after deciphering knows where the packet needs to
> be routed to.
WE know it (see NMS section above).
That's sweet but what does it help you unless you can influence the RS
but unfortunately in a cluster you never know which RS will get which
IPsec. Or did you want to statically assign every possible IPsec
connection pool from your CIP db a dedicated RS? This is possible if you
really want to do this. I mean you could setup the whole biest like follows:
o your NMS has a DB which "knows" about incoming CIP1->dst_net1 and
thus sets up the director with a fwmark1 (CIP1/32->VIP/32) and if he
smells CIP2 is requesting a IPsec connection, you setup a new entry
to LVS using fwmark2 (CIP2/32->VIP/32).
o your RS (IPsec endpoints) still need to be able to route to all
possible net entities addressed by your customers.
o This mine topology tends to load imbalance horribly.
o The probability I still didn't understand the influence of your NMS
is still not negligible.
sounds interesting - where can I read more about it without
understanding the code itself - did I miss some piece of Documentation?
If not, where should I start reading the code?
Oh, you read the documentation about persistency. This should keep you
busy a while:
http://www.linux-vs.org/Joseph.Mack/HOWTO/LVS-HOWTO-8.html
http://www.linuxvirtualserver.org/~julian/LVS.txt
I could not follow your conversation in detail. Maybe I find the time to
read it again with a bit more investigation of the backgrounds this week.
Good.
>>Back to these mysterious hashes: if we make them static, we waste the
>>opportunity for the 'node sheduling' - so we must get to something
>>semi-dynamic :-)
>>
>
> You lost me here.
Ups, maybe it's too far away from LVS. If it's still importatnt, I'll
try to describe it again...
Ok.
>>We think of a big hash, that stores the IP Adresses and the node for
>>that IP-address. So if you have an IP, a quick search for the
>>corresponding node is possible.
>>
>
> What do you refer to as node?
Sorry! Node = RealServer
I should use your terminology!
Hmm, it really sounds like you want to use fwmark and persistency like
hell :) Tell me, what do you refer to with IP addresses? The CIP or the
destination networks?
>>A look at the sh and dh algorithms was quite helpful for me. But there
>>the hash is too small, and it's static.
>>
>
> What do you mean with static hash?
I have the Impression, that I did not understand sh / dh very well.
But the only descreption was the following:
"The destination hashing scheduling algorithm assigns network
connections to the servers through looking up a statically assigned hash
table by their destination IP addresses"
Maybe I should have used statically assigned.
Ahh, ok. But if I get your idea the dh scheduler would not help either
because you don't get to choose which hash corresponds to which RS. And
besides, I think the table would be big enough. It is a hash table which
has a doubly-linked list for multiple equal hash matches. Maybe you want
something like follows:
H1 ----> RS1 (members: CIP1 <--> CIP2 <--> CIP6)
H2 ----> RS2 (members: CIP4 <--> CIP3)
H3 ----> RS3 (members: CIP5)
hashing: a method for directly referencing records in a table by doing
arithmetic transformations on keys into table addresses. [Robert
Sedgewick, Algorithms]
:) Yep, always got my copy handy.
But I don't think, that answers your question :-)
Success! We 100% agree on something.
You might have wondered about the 50MB. I intended to have a hash to
search for IP Addresses to get back a node number.
In the near future we will have 10000 (10K) IPSec tunnels. behind each
tunnel we'll have a subnet with 32 as an average number of IP-addresses.
So we have 320K IP-addresses. Somewhere (don't ask me where, I think it
was for persistenca table) I read that for an entry in a hash you need
128bytes. 320000 x 128 = 40960000 (bytes)
40960000 / 1024 = 40000 (kilobytes)
40000 / 1024 = 39,0625 (megabytes)
This would be the memory for the entries, without the hash itself, which
should have twice the number of entries as size for a fast search.
The dh or sh schedulers only have 128 entries in the hash table which
are unique.
where is my fault?
The hash is included in the the 128 indirectly over a pointer to the
ip_vs_scheduler struct which calls the appropriate scheduler. I think I
understand your table now which you have drawn in the other email. You
can do such a setup with fwmarks and persistency but this will not help
because the director will never find out based on the encrypted packet
he has to load balance where the final packet will be routed to und thus
never match one of your rules. So no, this doesn't work, I'm afraid. You
need to let the director choose a RS for you and then make sure every RS
can route to every network, unless!!! you really know the CIP for every
single customer and can assure it will stay the same or at least
consistant with your NMS DB. Then it will work again, however. It would
be a pleasure for me to give you the idea for that if I one day figure
out the connection between your NMS, your customers, the RS and the load
balancer.
so sheduling in userspace would be an alternative?
Could be but I still think it's just me that hasn't completely
understood what you want.
WOW! that's encouraging!
The routing will definitely be possible and thus the load balancing with
LVS_DR. For NAT we're not so sure yet.
Maybe I don't understand it myself, but some things correspond to our
product/business case/whatever you call it. I hope I will be able to
tell more about it, but at the moment I can't - at least not here in the
list ...
I understand.
Thanks again to everyone involved...
No problem, not a lot has been done yet, except that Julian and me read
some RFC's ...
Best regards,
Roberto Nibali, ratz
|