LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: FreeS/WAN Cluster - our approach

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: FreeS/WAN Cluster - our approach
From: Henrik Rossner <lvs@xxxxxxxxxxxxxxxxx>
Date: Mon, 18 Feb 2002 16:39:09 +0100
Hello,

  > No worries, you're invited to help coding if you can ;)

I'm not sure if I can, but if we're lucky, our management agrees, that
it's necessary to get some resources from our software developers - but
I can't promise.

  > In what timeframe does this stuff need to work? And would you maybe
  > consider giving Julian and me a shell account on your setup, because
  > I doubt we can set this up at home?

Timeframe (backwards):
We want to have a system in production till the end of the year, which
includes testing.
In May we want to know in detail what we will do.
By the end of this month (maybe first week in march) we have to make the
decision: LVS or not, cluster or not, Linux or not ... for that we must
have some kind of plan, which looks like it could work :-)

I'm not sure about the shell account, but I'd rather say no - sorry,
this is not in my hands... maybe we can find another solution like a
defined Interface.

  >>Target:
  >>-------
  >>Cluster of IPSec Nodes for Load balancing (redundancy will be added
later).
  >>
  >
  > Oh, I'm not so sure if we can do state table synchronisation with
  > ESP/AH hashed table entries, unless we find out how the timeout and
  > such work. Maybe it's enough to synchronise the SDP pool.

sorry - what is the SDP pool

  >>Assumptions:
  >>------------
  >>- high traffic rates (200MBit+ in total)
  >>- many tunnels (1000+)


  > You must have a lot of customers ... or a broken application.

Lets say, many customers will use it :-)

>>- we somehow know* about all the existing tunnels and the subnets behind
  >>them
  > What do you mean by that?
  >
  >>* in fact we have this Information in our 'Network Management System'
  >>BEFORE a tunnel is set up.
> Why before and what good does a NMS do if it doesn't reflect the current
  > real picture?

ok, it's a little bit more than that, maybe the word Traffic Controler
would be more accurate - we know what tunnels are possible
and we know what tunnels are active. So we *have* a reflection of the
active state, and also have an idea of how much traffic will come from
which IPs.
Importatnt is, that when a tunnel from IP 192.168.123.234 comes up, we
already know, it will tunnel e.g. 143.193.40.32/28 AND what is more
important, if a packet for IP 143.193.40.36 comes from the 'target
subnsts' we already know, it will go through an IPSec tunnel to
192.168.123.234.
And more than that, 192.168.123.234 has to tell us, before it cans send
anything, so we know 'the tunel will be built in a few seconds'.

  >
  >>       +--------------+   +--------------+
  >>       |  IPSec term. |   |  IPSec term. |
  >>       +--------------+   +--------------+
  >>               |                |
  >>       +--------------+   +--------------+
  >>       | many Subnets |   | many Subnets |
  >>       +--------------+   +--------------+
  >>
  >>We want to make it possible to have a secure connection from the 'many
  >>Subnets' to the 'target subnets'.

  > Do you intend to run IPsec in tunneling mode? I would assume so for
  > security and compatibility reasons.

Yes, you're right.

  >
  >>Main Problems:
  >>--------------
  >>- packets of the same IPSec tunnel MUST terminate in the same box

  > Once we get the routing of ESP/AH packets in LVS this is a piece of
  > cake with persistency.

how many connections can persistency handle? 1.000 'IPSec terms' is a
starting number :-) Assume an average number of hosts behind an 'IPSec
term' to be about 32. So we have 32.000 hosts. This is to be multiplied
by the average number of connections/host (for the Director at the
'target subnets' side) ...

  >>- packets from 'target subnets' destined for 'many Subnets' thgrough an
>>IPSec tunnel MUST go to the correct node (the one that terminates/begins
  >>the tunnel).
  >>
  >
  > My understanding of IPsec says this is an implication of the problem
  > above. Well, since we do LVS-DR, we don't rewrite the IP header and
  > thus still have the saddr information. If you're routing is ok on the
  > RS this is a solved issue.

It's the same Problem, but from the other direction

target-host1---|                |--IPSEC----- many-host1,2,3...
                |                |  decrypt.
target-host2---|   IPSec        |--IPSec----- many-host46,47,48...
                |-- encryption --|  decrypt.
...           ...              ...
                |                |
Target-hostm---|                |--IPSEc----- many-hostn,n+1,n+2...
                                    decrypt

The first problem is right to left, the second one is left to right.


  >>Our Approach for the solution:
  >>------------------------------
  >>(at first I look at the problem from bottom to top of the drawing)
  >>We want to distribute the IPSec tunnels to the nodes. Each 'IPSec term'
  >>has:
  >>- an IP-Address (=tunnel starting address)
  >>- a subnet behind it (in the future there may be more of them, but for
  >>now one will do the job)
  >>
  >
  > Every IPsec term. needs to be able to address all subnets or it defeats
  > the purpose of load balancing!

I assume IPSec term is IPSec in the RS.
Then the above statement is true, as long as no tunnel is set up. when
the tunnel for a subnet is up, we have no choice. Tunnels will stay for
hours, days, weeks ...

  >
  >>Director 1:
  >>send packets for subnet of ipsec_term_003 to node_1
  >>send packets for subnet of ipsec_term_009 to node_1
  >>send packets for subnet of ipsec_term_002 to node_1
  >>
  >>send packets for subnet of ipsec_term_005 to node_2
  >>send packets for subnet of ipsec_term_001 to node_2
  >>send packets for subnet of ipsec_term_010 to node_2
  >>
  >>send packets for subnet of ipsec_term_008 to node_3
  >>send packets for subnet of ipsec_term_004 to node_3
  >>send packets for subnet of ipsec_term_007 to node_3
  >>
  >>Director 2:
  >>send packets for tunnel starting address of ipsec_term_003 to node_1
  >>send packets for tunnel starting address of ipsec_term_009 to node_1
  >>send packets for tunnel starting address of ipsec_term_002 to node_1
  >>
  >>send packets for tunnel starting address of ipsec_term_005 to node_2
  >>send packets for tunnel starting address of ipsec_term_001 to node_2
  >>send packets for tunnel starting address of ipsec_term_010 to node_2
  >>
  >>send packets for tunnel starting address of ipsec_term_008 to node_3
  >>send packets for tunnel starting address of ipsec_term_004 to node_3
  >>send packets for tunnel starting address of ipsec_term_007 to node_3
  >>
  >
> Well, the problem with this is, that in IPsec tunneling mode, you simply
  > don't have this information about the to-be-routed subnet in the non
  > decrypted part of the IP packet. Read my email exchange with Julian. In
  > tunneling mode you only have the daddr which is equal to the VIP and
  > only the IPsec term. after deciphering knows where the packet needs to
  > be routed to.

WE know it (see NMS section above).

  >
  >>The next Idea is to store this Information in a hash in the directors.
  >>
  >
  > This is called the connection affinity template, which is used for the
  > schedulers to effectively load balance traffic.

sounds interesting - where can I read more about it without
understanding the code itself - did I miss some piece of Documentation?
If not, where should I start reading the code?

  >>Wheather the assignment of the 'ipsec terms' to the nodes is done in
  >>Director 1, Director 2 or another machine isn't clear to me, but it has
  >>to be done in one point (either DR1 or DR2 or somewhere else). Round
  >>robin should work for the beginning, maybe we can do some tuning here,
  >>when the system works.
  >>
  >
  > Yes, as mentioned in my last email to Julian, I have a new algorithm
  > in mind that doesn't need artificial nor TCP header information to be
  > able to load balance.

I could not follow your conversation in detail. Maybe I find the time to
read it again with a bit more investigation of the backgrounds this week.

  >>Back to these mysterious hashes: if we make them static, we waste the
  >>opportunity for the 'node sheduling' - so we must get to something
  >>semi-dynamic :-)
  >>
  >
  > You lost me here.

Ups, maybe it's too far away from LVS. If it's still importatnt, I'll
try to describe it again...

  >
  >>We think of a big hash, that stores the IP Adresses and the node for
  >>that IP-address. So if you have an IP, a quick search for the
  >>corresponding node is possible.
  >>
  >
  > What do you refer to as node?

Sorry! Node = RealServer
I should use your terminology!

  >
  >>A look at the sh and dh algorithms was quite helpful for me. But there
  >>the hash is too small, and it's static.
  >>
  >
  > What do you mean with static hash?

I have the Impression, that I did not understand sh / dh very well.
But the only descreption was the following:
"The destination hashing scheduling algorithm assigns network
connections to the servers through looking up a statically assigned hash
table by their destination IP addresses"
Maybe I should have used statically assigned.

  >
  >>In fact I have no clue if it's possible to have such a hash which will
  >>probabaly be >50MB and I have no clue how to modify such a hash from
  >>'outside', i.e.  another server.
  >>
  >
  > :) Might I ask you for your definition of a hash?

hashing: a method for directly referencing records in a table by doing
arithmetic transformations on keys into table addresses. [Robert
Sedgewick, Algorithms]
But I don't think, that answers your question :-)
You might have wondered about the 50MB. I intended to have a hash to
search for IP Addresses to get back a node number.
In the near future we will have 10000 (10K) IPSec tunnels. behind each
tunnel we'll have a subnet with 32 as an average number of IP-addresses.
So we have 320K IP-addresses. Somewhere (don't ask me where, I think it
was for persistenca table) I read that for an entry in a hash you need
128bytes. 320000 x 128 = 40960000 (bytes)
40960000 / 1024 =  40000 (kilobytes)
40000 / 1024 = 39,0625 (megabytes)
This would be the memory for the entries, without the hash itself, which
  should have twice the number of entries as size for a fast search.

where is my fault?

  > You cannot store
  > 50 Mbyte in the kernel or I must have missed some OS concepts.

so sheduling in userspace would be an alternative?

  >
  >>That's our 'concept' by now.
  >>Does it sound stupid? or realistic?
  >
  > Some of it sounds great and feasable,

WOW! that's encouraging!

  > the rest I might not understand
  > and as such sounds strange.

Maybe I don't understand it myself, but some things correspond to our
product/business case/whatever you call it. I hope I will be able to
tell more about it, but at the moment I can't - at least not here in the
list ...

  > Keep on thinking and proposing, we will find a solution.

sounds great, I will do so.

Thanks again to everyone involved...

Henrik.








<Prev in Thread] Current Thread [Next in Thread>