LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: FreeS/WAN Cluster - our approach

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: FreeS/WAN Cluster - our approach
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Tue, 19 Feb 2002 00:55:51 +0100
Hi Henrik,

I'm not sure if I can, but if we're lucky, our management agrees, that
it's necessary to get some resources from our software developers - but
I can't promise.

No problem, Julian is a monster coder, he can handle it :)

Timeframe (backwards):
We want to have a system in production till the end of the year, which
includes testing.

Regression tests or just functionality tests?

In May we want to know in detail what we will do.

Feasable.

By the end of this month (maybe first week in march) we have to make the
decision: LVS or not, cluster or not, Linux or not ... for that we must
have some kind of plan, which looks like it could work :-)

I hope so.

I'm not sure about the shell account, but I'd rather say no - sorry,
this is not in my hands... maybe we can find another solution like a
defined Interface.

No problem, Julian has a test setup. This should be enough. Most of the time he only needs to compile the code (and it works!) as a test.

  > Oh, I'm not so sure if we can do state table synchronisation with
  > ESP/AH hashed table entries, unless we find out how the timeout and
  > such work. Maybe it's enough to synchronise the SDP pool.

sorry - what is the SDP pool

It's the security policy database on each side of the IPsec endpoint. It is a part of KLIPS and has information to build the SA's.

Lets say, many customers will use it :-)

Ok.

ok, it's a little bit more than that, maybe the word Traffic Controler
would be more accurate - we know what tunnels are possible
and we know what tunnels are active. So we *have* a reflection of the
active state, and also have an idea of how much traffic will come from
which IPs.

I still wonder how this is possible. How can you know how much traffic you get from which IP in advance? Or is this yet another English understanding problem of mine?

Importatnt is, that when a tunnel from IP 192.168.123.234 comes up, we
already know, it will tunnel e.g. 143.193.40.32/28 AND what is more
important, if a packet for IP 143.193.40.36 comes from the 'target
subnsts' we already know, it will go through an IPSec tunnel to
192.168.123.234.
And more than that, 192.168.123.234 has to tell us, before it cans send
anything, so we know 'the tunel will be built in a few seconds'.

That needs some damn intelligent software which needs to be more HA then the rest of the framework because if your system fails, no customer will be able to connect (despite the tunnels might be working), right?

how many connections can persistency handle? 1.000 'IPSec terms' is a
starting number :-) Assume an average number of hosts behind an 'IPSec
term' to be about 32. So we have 32.000 hosts. This is to be multiplied
by the average number of connections/host (for the Director at the
'target subnets' side) ...

This would need, lets calculate ...

32*1000*128 Bytes ~ 4 MBytes. Hmm do you think you can afford that much RAM for you boxes? :)

That's what you would need on the director to sustain the template entries in the hash table for LVS_DR. Since those are not per second, this is pretty much nothing. You can take an old P166 to do that.

It's the same Problem, but from the other direction

target-host1---|                |--IPSEC----- many-host1,2,3...
                |                |  decrypt.
target-host2---|   IPSec        |--IPSec----- many-host46,47,48...
                |-- encryption --|  decrypt.
...           ...              ...
                |                |
Target-hostm---|                |--IPSEc----- many-hostn,n+1,n+2...
                                    decrypt

The first problem is right to left, the second one is left to right.

Honestly I don't see a problem. It's just a routing issue. host1 connects via IPsecGW_a1 to VIP which distributes it to one IPsecGW_bX which will get the routing information in the ESP packet after decrypting it and send it there. Of course IPsecGW_bX knows where to route the back packet after encryption. Maybe I still don't get it. So let's find out. Could you tell me if your setup would work without a load balancer just by doing 1:1 IPsecGW_aX <-> IPsecGW_bX assignment? Because if this works, then it works with LVS_DR.

  > Every IPsec term. needs to be able to address all subnets or it defeats
  > the purpose of load balancing!

I assume IPSec term is IPSec in the RS.

Yes :) [IPsec term. == IPsec endpoint == IPsecGW_bX == RS]

Then the above statement is true, as long as no tunnel is set up. when
the tunnel for a subnet is up, we have no choice. Tunnels will stay for
hours, days, weeks ...

Julian, do you copy? WEEKS!! This is going to use more RAM eventually. What is you rekeying time policy? Do you provide an SLA-like agreement where you say that after 2 weeks of no byte stream we will close the tunnel?

> Well, the problem with this is, that in IPsec tunneling mode, you simply
  > don't have this information about the to-be-routed subnet in the non
  > decrypted part of the IP packet. Read my email exchange with Julian. In
  > tunneling mode you only have the daddr which is equal to the VIP and
  > only the IPsec term. after deciphering knows where the packet needs to
  > be routed to.

WE know it (see NMS section above).

That's sweet but what does it help you unless you can influence the RS but unfortunately in a cluster you never know which RS will get which IPsec. Or did you want to statically assign every possible IPsec connection pool from your CIP db a dedicated RS? This is possible if you really want to do this. I mean you could setup the whole biest like follows:

o your NMS has a DB which "knows" about incoming CIP1->dst_net1 and
  thus sets up the director with a fwmark1 (CIP1/32->VIP/32) and if he
  smells CIP2 is requesting a IPsec connection, you setup a new entry
  to LVS using fwmark2 (CIP2/32->VIP/32).
o your RS (IPsec endpoints) still need to be able to route to all
  possible net entities addressed by your customers.
o This mine topology tends to load imbalance horribly.
o The probability I still didn't understand the influence of your NMS
  is still not negligible.

sounds interesting - where can I read more about it without
understanding the code itself - did I miss some piece of Documentation?
If not, where should I start reading the code?

Oh, you read the documentation about persistency. This should keep you busy a while:

http://www.linux-vs.org/Joseph.Mack/HOWTO/LVS-HOWTO-8.html
http://www.linuxvirtualserver.org/~julian/LVS.txt

I could not follow your conversation in detail. Maybe I find the time to
read it again with a bit more investigation of the backgrounds this week.

Good.

  >>Back to these mysterious hashes: if we make them static, we waste the
  >>opportunity for the 'node sheduling' - so we must get to something
  >>semi-dynamic :-)
  >>
  >
  > You lost me here.

Ups, maybe it's too far away from LVS. If it's still importatnt, I'll
try to describe it again...

Ok.

  >>We think of a big hash, that stores the IP Adresses and the node for
  >>that IP-address. So if you have an IP, a quick search for the
  >>corresponding node is possible.
  >>
  >
  > What do you refer to as node?

Sorry! Node = RealServer
I should use your terminology!

Hmm, it really sounds like you want to use fwmark and persistency like hell :) Tell me, what do you refer to with IP addresses? The CIP or the destination networks?

  >>A look at the sh and dh algorithms was quite helpful for me. But there
  >>the hash is too small, and it's static.
  >>
  >
  > What do you mean with static hash?

I have the Impression, that I did not understand sh / dh very well.
But the only descreption was the following:
"The destination hashing scheduling algorithm assigns network
connections to the servers through looking up a statically assigned hash
table by their destination IP addresses"
Maybe I should have used statically assigned.

Ahh, ok. But if I get your idea the dh scheduler would not help either because you don't get to choose which hash corresponds to which RS. And besides, I think the table would be big enough. It is a hash table which has a doubly-linked list for multiple equal hash matches. Maybe you want something like follows:

H1 ----> RS1 (members: CIP1 <--> CIP2 <--> CIP6)
H2 ----> RS2 (members: CIP4 <--> CIP3)
H3 ----> RS3 (members: CIP5)

hashing: a method for directly referencing records in a table by doing
arithmetic transformations on keys into table addresses. [Robert
Sedgewick, Algorithms]

:) Yep, always got my copy handy.

But I don't think, that answers your question :-)

Success! We 100% agree on something.

You might have wondered about the 50MB. I intended to have a hash to
search for IP Addresses to get back a node number.
In the near future we will have 10000 (10K) IPSec tunnels. behind each
tunnel we'll have a subnet with 32 as an average number of IP-addresses.
So we have 320K IP-addresses. Somewhere (don't ask me where, I think it
was for persistenca table) I read that for an entry in a hash you need
128bytes. 320000 x 128 = 40960000 (bytes)
40960000 / 1024 =  40000 (kilobytes)
40000 / 1024 = 39,0625 (megabytes)
This would be the memory for the entries, without the hash itself, which
  should have twice the number of entries as size for a fast search.

The dh or sh schedulers only have 128 entries in the hash table which are unique.

where is my fault?

The hash is included in the the 128 indirectly over a pointer to the ip_vs_scheduler struct which calls the appropriate scheduler. I think I understand your table now which you have drawn in the other email. You can do such a setup with fwmarks and persistency but this will not help because the director will never find out based on the encrypted packet he has to load balance where the final packet will be routed to und thus never match one of your rules. So no, this doesn't work, I'm afraid. You need to let the director choose a RS for you and then make sure every RS can route to every network, unless!!! you really know the CIP for every single customer and can assure it will stay the same or at least consistant with your NMS DB. Then it will work again, however. It would be a pleasure for me to give you the idea for that if I one day figure out the connection between your NMS, your customers, the RS and the load balancer.

so sheduling in userspace would be an alternative?

Could be but I still think it's just me that hasn't completely understood what you want.

WOW! that's encouraging!

The routing will definitely be possible and thus the load balancing with LVS_DR. For NAT we're not so sure yet.

Maybe I don't understand it myself, but some things correspond to our
product/business case/whatever you call it. I hope I will be able to
tell more about it, but at the moment I can't - at least not here in the
list ...

I understand.

Thanks again to everyone involved...

No problem, not a lot has been done yet, except that Julian and me read some RFC's ...

Best regards,
Roberto Nibali, ratz



<Prev in Thread] Current Thread [Next in Thread>