Re: Performance evaluation of LVS

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: Performance evaluation of LVS
Cc: orourke@xxxxxxxxxxx, keefe@xxxxxxxxxxx, Joseph Mack <mack.joseph@xxxxxxx>
From: Roberto Nibali <ratz@xxxxxx>
Date: Mon, 16 Apr 2001 10:47:57 +0200
Mike Keefe wrote:
> All,
>     For those desiring to look at our recent performance evaluation of LVS,
>     is provided.  We welcome any and all comments.  I'll be answering any
> related
>     questions for the next two weeks during the absence of Pat O'Rourke who is
>     on vacation.
>     Mike Keefe

Hi Mike,

Sorry for the late reply but as people on this list know I'm
very busy as everybody basically :)

Reading your excellent performance evaluation of the LVS 
project I first thought, oh no, again people without indepth
knowledge about Linux tuning, networking, TCP/IP and SMP,
but as I proceeded through the whole text I was amazed by
the high professionality this evaluation was done. The final
impression: You did a very nice job, guys, and this is not
because you wrote, that in your opinion LVS is a considerable
replacement for commercial load balancers but because you
touched nearly every tweaking parameter I would have considered
to have important impacts on the final testing result. It
was funny, I was reading through the text with a blue pen 
and in the beginning my notes were bigger then your actual
text. But by finishing reading the article I got a lot of 
my open questions and concerns covered. However, there are
some comments I wanted to share (took a long time, I know,
but I currently have hardly time to contribute usefull 
knowledge and ideas to the project itself). Please excuse my
bad English, since I'm not a native English speaker and I've
written this in quite a hurry.

o In the abstract you wrote about that detrimental impact
  of multiple (PCI) devices sharing interrupts. And I know
  that you also made some tests with APIC enabled for SMP.
  I vaguely recall that Andrea Arcangeli or Andrew Morton 
  have published once a nice APIC patch for UP systems for 
  PCI 1.1 compliance which addresses this problems for 
  2.2.x kernels. I was wondering if you found that patch.

o In ``Table 1: Comparison of LVS Features'' you write as
  an advantage for the Direct Routing method that "Director
  only handles client-to-server half of connection ...". For
  me it's clear that with this statement you refer to the
  obiously higher throughput you can achieve but people could
  also misinterprete this as a disadvantage because of the
  possible malicious attacks possible in this non-trivial
  TCP-routing path. I suggest you write higher troughput as
  Wensong did in his first paper. It's simpler and people 
  don't get confused.

o In ``Table 1: Comparison of LVS Features'' again for the DR
  method listed under Disadvantages you write the two statements
  "Servers need public IP addresses" and "Server OS requires
  non-ARPing network interface". The first statement is wrong.
  It's a matter of your network setup and this doesn't imply
  that you have to use public IP addresses. This is a common
  error. We use only direct routing setups for our customers
  and the NAT is done either by a powerfull CISCO router and
  /or a proxy firewall. The second statement has to be taken
  with care. You're right that mandatory the Server-OS has to
  provide a non-ARPing interface but you forgot that in most
  cases you need alias support and there is the big problem:
  Some Unices have no support for aliased interfaces or only
  limited, such as QNX, Aegis or Amoeba for example. Others
  have interface flag inheritance problems like HP-UX where
  it is impossible to give an aliased interface a different
  flag vector as for the underlying physical interface. So 
  for HP/UX you need a special setup because with the standard
  depicted setup for DR it will NOT work :) I've done most
  Unices as Realserver and was negatively astonished by all 
  the different implementation variations of the different
  Unix flavours. This maybe resulted from unclear statements
  from the RFC's.

o Seeing your very intelligent and nice test setup I have to
  give some input you have to see. I will give more input on
  the setup with subsequent paragraphs. I see the problem that
  you used a kernel (2.2.17) with massive TCP problems. One
  of the problems is the wrong or not optimal TCP slowstart 
  and another one is the congestion window problem which shows
  up at a certain connection rate and gave me a hard time 
  figuring out why my rsync over ssh would stall the whole
  time. I suggest doing the tests with 2.2.19 should show some
  performance differences. I might be wrong though if I study
  your Figure 2 ``Netpipe Signature Graph''. I reckon the curve
  might get even more steep using a 2.2.19 kernel.

o I like your tuning methods although there would have been 
  more things to check. Would it be f.e. possible for you to
  post the sample .config file for your kernel you compiled?
  It may reveil some performance issues too. Did you test 
  tweaking the TCP/IP setting for lowering the memory pool
  to be hold persistently? For example for your test, were 
  you know that you only fetch ~900 bytes per patch in one
  single non-fragmented packet you might set the TCP states
  in the proc-fs very low in order to allow the kernel and
  the mm-structure for the skb to release the buffer very
  fast and mark it as dirty as less as possible. Cache 
  pollution may be reduced, the control path optimized. This
  is how commercial vendors let their hardware test. 
  Another problem I see in your results seems to be the very
  early saturation of the NIC. Yes, I do believe it is a
  problem of the driver not capable of doing zerocopy skb
  passes. Could you share the amount of frames and overruns
  you had during the test? After I've seen the numbers for
  the NIC troughput in Figure 2 I'm sure there is a lot of
  optimization possible :) I yet have to figure out if this
  is the overall limitation I've seen in all of your figures.
  There seems to be the 4000/8000/12000 connections/s barrier
  for 2.2 single CPU, DR/NAT/2.4 kernel. This 4000 limitation
  might also be a signal queue delivery problem as we see in
  our packetfilter tests. Filtering more then 40Mbit/s which
  would be the rate you are experiencing with 4000 conns/s 
  under 2.2 masquerading code is getting a very difficult 
  task. Since LVS hooks in the masquerading code we have the
  same problem, either filtering or load balancing.

o What I miss a bit in your results are the memory consumption
  during and after the tests, the CPU loads, the 2.4.x memory
  balance information from the proc-fs. I am aware of the figures
  you give in Table 5 and Table 6. I can only do some suggestions 
  based on Figure 4 ``Single CPU Linux 2.2.17, LVS-DR Throughput'' 
  which is indeed a very interesting sketch.
  At least for marketing purposes :). Interesting for me is the
  comparision between the "Direct" and the "DR, 1 RS" after the
  4000 peak ( --|-- vs. --x-- ). If we follow the line up to
  14000 requests/s we see a steeper curve for the DR as for the
  direct connection. My assumption: Our heavily kmalloc for the
  multiple tables (128 Byte per entry) and the advanced state
  transition checks. (I hope you disabled the secure_tcp for 
  your tests). These are the additional kernel control paths
  we have to walk for the LVS code plus some additional if's
  in the masq-code and I think one in the routing decision :)
  According to this observation of mine your statement on page
  10 second paragraph "LVS-DR with a single real server appears
  to be on par ..." is only true up to the magic 4000 limit. Then
  we have an exponential drift between the scalability of direct
  connection and the DR over LVS. I assume by taking two key/value
  pairs and solving the y=a*x^b equation by finding the (a/b)
  parameter we might get the O(f(x)) relation for our overhead
  introduced with the LVS code. Your mileage may vary. Wensong,

o For Figure 5 ``SMP scaling on Linux 2.2.17'' I was wondering
  if we could see the same linear improvement over UP if we do
  following tests: 1 CPU @ 1 RS versus 4 CPU @ 1 RS.

o Page 14 you talk about the theoretical limit of 16000 conns/s
  but you give no context of what should cause this limitation.
  Can you enlighten me, please?

o Table 4 "Aggregate Network I/O", I'm afraid but I think the
  two last values are not correct since in the LVS-DR setup the
  replies don't go through the load balancer. Only the SYN, the
  ACK, the FIN or FIN ACK, but no payload for the loadbalancer
  and I doubt you get 12.5 Mbit/s I/O only for simple TCP state
  flags. Please check the numbers again and report me and in
  case I'm wrong, I stand corrected.

o Now a really astonishing fact is that you obivously found out
  that the do_aic7xxx_isr() function is being called for every
  packet received. Well, I have not yet full indepth knowledge
  of the 2.4 kernel internals but I know that the aic7xxx.c driver
  is currently frequently changed by Gibbs and if you read the lkml 
  you'll see nearly every week a new big patch for this super-
  broken driver. In your case, I would have put in cheap IDE disks 
  because disk I/O was not a important disturbing factor for 
  this test.

o Ok, I'm slowly comming to an end :). Page 19, Chapter 4.5 
  "LVS versus a hardware load balancer" is at least for me a
  very interesting and important chapter. I've been performing
  hundreds of tests and different setups with load balancers,
  from Cisco over BigIP and SercerIron to ACEdirector[2|3|180].
  I've collected various statements from their labs. Unfortunately
  I've got a very limited budget and therefore a small 100Mbit
  test lab but nevertheless I could see the tendency of cost
  effectiveness as stated also by your article. I have some
  additional input from my experiences with commerical load 
  balancers. Now, where should I start. Well, let's start by a
  completely unrelated, stupid and non-rational (but sometimes
  reflecting the truth) statement of mine: "Commercial load
  balancers suck! They are expensive, not flexible, limited,
  support is not technical enough, buggy and insecure." Choose
  one of the listed topics and compare LVS with it on a commercial
  base and try to weigth the impact/importance of your deployment
  with a possible failure in one of those subjects with a 
  commercial load balancer. Yes I know, it is true, that they
  may load balance faster, consume less memory and CPU, but

  - how come I use my 16000$ ACEdirector2 as a stupid switch for
    my LVS net? This is simple: Lack of thorough technical support
    for my setup. Yes, I wanted to have full redundancy with
    a real audio sticky stream setup for and yes,
    I wanted to load balance more then 16 Realservers without 
    buying a stackable ACE unit for +7000$ and yes I wanted 
    security and not a buggy packet filtering implementation 
    based on a even buggier TCP/IP stack implementation.
  - why for heaven's sake does the CISCO local director start 
    to try passing 9% NAT'd packets directly without NATing 
    them when the load rises under a SYN-flood attack? Yeah, it's
    state transition table corruption and people don't care nor
    realize it because there is still too less people looking
    at a tcpdump or firewall/IDS logfile to check and verify the
    correct packet handling of the closed black box "commercial
    load balancer". But the manager sees the nice graphical chart
    where it says "... up to 1'000'000 concurrent connections"
    or "can loadbalance firewalls :)" and this convinces him enough 
    to buy a Gigabit load balancer for his E10k Sun cluster for
    his very important e-commmerce project having a total through-
    put in peak times of about 15Mbit/s. Your article helps opening
    people's eyes to true cost effectivness and minimal operating
    time. So, now let's get back to technical stuff and less OT 
    threads ... (BTW, this is my sole opinion and not my company's.)
  - could anybody tell me how to perform and end-2-end healthcheck
    for cluster availability for a 3 Tier Java architecture with a 
    commercial load balancer? You can do every OSI layer up to 5
    defined check, but what if I have to check my database with 
    special input to be sure the service is still available? To add
    this very important feature for datacenters and SLA's you have
    to multiply the costs of a commercial load balancer by 2-3.
  - how about technical help? I intend to optimize database load-
    balancing which I do with the same load balancer box as I do
    the frontend webserver load balancing. Now, I change the TCP
    state transition to address the different behavior of databases
    in TCP/IP networks. I tell this my local ServerIron vendor, ask
    him to give me the equivalent setup because I see that with his
    setup the mSQL db dies and rather then fixing the db I adjust
    some TCP parameters. None response but yet another IOS patch ...
  - SECURITY! Every commercial load balancer I've tested some time
    ago had serious problems with forged TCP or UDP packets. I recall
    the time where you could reboot a load balancer with nmap. With
    the LVS you have a complete working packetfiltering and more
    or less effective DoS defense strategies builtin.
  - redundancy: Now this is one of the yet weak points of the LVS
    load balancer you haven't explicitly addressed and also not brought
    into your total cost of a load balancer setup. If for people 
    redundancy is an issue/subject as for deciding which load balancer
    to deploy one has to consider that cost-effectivness rises even
    more for the LVS approach because for every commercial HA setup
    you have to buy either the second load balancer for the same prize
    or you have to buy the additional software. For a HA LVS cluster
    you only buy the hardware for the second load balancer node and
    download the HA package and off you go.

o Last but not least (I will shut my mouth after this paragraph): A
  little tip for the FD_SET problem in 2.2.x kernel: Use SIGIO and
  drop the select system call and work with signals :) Your problems
  are gone. Read the OLS paper written by Andi Kleen about queued
  SIGIO last year or ask Julian Anastasov :). On page 23 you give 2
  reasons/problems you experienced using the httperf tool. I suggest
  tuning the TCP_TIMEWAIT and CLOSE_WAIT should improve the amount
  of concurrent connections since the fd's can be closed faster. 
  Glancing at your e1000_kcompat.h patch I assume you loaded the 
  driver as a module? If so, why?

Finally I have to say again, that this is a very very good report
reflecting the real numbers and problems with LVS and it shows that
very talented and knowledgable guys have been performing these 
tests. Thank you for doing it. Honestly, I was a little bit angry
I didn't get a previous draft of this test because working since
over 2 years for and with the LVS project I'm sure some of my stuff 
here could have had impact on your material. I hope I could give 
you a usefull summary from yet another perspective. Keep up the good
work and I'm looking forward to seeing more tests for LVS from you. 
I'm happy that the LVS project made it from a more or less academical
proof of concept code to a fully comparative/equivalent load balancer
to commercial solutions.

Best regards, the crazy guy,
Roberto Nibali, ratz

<Prev in Thread] Current Thread [Next in Thread>