Mike Keefe wrote:
>
> All,
> For those desiring to look at our recent performance evaluation of LVS,
> lvs.ps.gz
> is provided. We welcome any and all comments. I'll be answering any
> related
> questions for the next two weeks during the absence of Pat O'Rourke who is
>
> on vacation.
>
> Mike Keefe
Hi Mike,
Sorry for the late reply but as people on this list know I'm
very busy as everybody basically :)
Reading your excellent performance evaluation of the LVS
project I first thought, oh no, again people without indepth
knowledge about Linux tuning, networking, TCP/IP and SMP,
but as I proceeded through the whole text I was amazed by
the high professionality this evaluation was done. The final
impression: You did a very nice job, guys, and this is not
because you wrote, that in your opinion LVS is a considerable
replacement for commercial load balancers but because you
touched nearly every tweaking parameter I would have considered
to have important impacts on the final testing result. It
was funny, I was reading through the text with a blue pen
and in the beginning my notes were bigger then your actual
text. But by finishing reading the article I got a lot of
my open questions and concerns covered. However, there are
some comments I wanted to share (took a long time, I know,
but I currently have hardly time to contribute usefull
knowledge and ideas to the project itself). Please excuse my
bad English, since I'm not a native English speaker and I've
written this in quite a hurry.
o In the abstract you wrote about that detrimental impact
of multiple (PCI) devices sharing interrupts. And I know
that you also made some tests with APIC enabled for SMP.
I vaguely recall that Andrea Arcangeli or Andrew Morton
have published once a nice APIC patch for UP systems for
PCI 1.1 compliance which addresses this problems for
2.2.x kernels. I was wondering if you found that patch.
o In ``Table 1: Comparison of LVS Features'' you write as
an advantage for the Direct Routing method that "Director
only handles client-to-server half of connection ...". For
me it's clear that with this statement you refer to the
obiously higher throughput you can achieve but people could
also misinterprete this as a disadvantage because of the
possible malicious attacks possible in this non-trivial
TCP-routing path. I suggest you write higher troughput as
Wensong did in his first paper. It's simpler and people
don't get confused.
o In ``Table 1: Comparison of LVS Features'' again for the DR
method listed under Disadvantages you write the two statements
"Servers need public IP addresses" and "Server OS requires
non-ARPing network interface". The first statement is wrong.
It's a matter of your network setup and this doesn't imply
that you have to use public IP addresses. This is a common
error. We use only direct routing setups for our customers
and the NAT is done either by a powerfull CISCO router and
/or a proxy firewall. The second statement has to be taken
with care. You're right that mandatory the Server-OS has to
provide a non-ARPing interface but you forgot that in most
cases you need alias support and there is the big problem:
Some Unices have no support for aliased interfaces or only
limited, such as QNX, Aegis or Amoeba for example. Others
have interface flag inheritance problems like HP-UX where
it is impossible to give an aliased interface a different
flag vector as for the underlying physical interface. So
for HP/UX you need a special setup because with the standard
depicted setup for DR it will NOT work :) I've done most
Unices as Realserver and was negatively astonished by all
the different implementation variations of the different
Unix flavours. This maybe resulted from unclear statements
from the RFC's.
o Seeing your very intelligent and nice test setup I have to
give some input you have to see. I will give more input on
the setup with subsequent paragraphs. I see the problem that
you used a kernel (2.2.17) with massive TCP problems. One
of the problems is the wrong or not optimal TCP slowstart
and another one is the congestion window problem which shows
up at a certain connection rate and gave me a hard time
figuring out why my rsync over ssh would stall the whole
time. I suggest doing the tests with 2.2.19 should show some
performance differences. I might be wrong though if I study
your Figure 2 ``Netpipe Signature Graph''. I reckon the curve
might get even more steep using a 2.2.19 kernel.
o I like your tuning methods although there would have been
more things to check. Would it be f.e. possible for you to
post the sample .config file for your kernel you compiled?
It may reveil some performance issues too. Did you test
tweaking the TCP/IP setting for lowering the memory pool
to be hold persistently? For example for your test, were
you know that you only fetch ~900 bytes per patch in one
single non-fragmented packet you might set the TCP states
in the proc-fs very low in order to allow the kernel and
the mm-structure for the skb to release the buffer very
fast and mark it as dirty as less as possible. Cache
pollution may be reduced, the control path optimized. This
is how commercial vendors let their hardware test.
Another problem I see in your results seems to be the very
early saturation of the NIC. Yes, I do believe it is a
problem of the driver not capable of doing zerocopy skb
passes. Could you share the amount of frames and overruns
you had during the test? After I've seen the numbers for
the NIC troughput in Figure 2 I'm sure there is a lot of
optimization possible :) I yet have to figure out if this
is the overall limitation I've seen in all of your figures.
There seems to be the 4000/8000/12000 connections/s barrier
for 2.2 single CPU, DR/NAT/2.4 kernel. This 4000 limitation
might also be a signal queue delivery problem as we see in
our packetfilter tests. Filtering more then 40Mbit/s which
would be the rate you are experiencing with 4000 conns/s
under 2.2 masquerading code is getting a very difficult
task. Since LVS hooks in the masquerading code we have the
same problem, either filtering or load balancing.
o What I miss a bit in your results are the memory consumption
during and after the tests, the CPU loads, the 2.4.x memory
balance information from the proc-fs. I am aware of the figures
you give in Table 5 and Table 6. I can only do some suggestions
based on Figure 4 ``Single CPU Linux 2.2.17, LVS-DR Throughput''
which is indeed a very interesting sketch.
At least for marketing purposes :). Interesting for me is the
comparision between the "Direct" and the "DR, 1 RS" after the
4000 peak ( --|-- vs. --x-- ). If we follow the line up to
14000 requests/s we see a steeper curve for the DR as for the
direct connection. My assumption: Our heavily kmalloc for the
multiple tables (128 Byte per entry) and the advanced state
transition checks. (I hope you disabled the secure_tcp for
your tests). These are the additional kernel control paths
we have to walk for the LVS code plus some additional if's
in the masq-code and I think one in the routing decision :)
According to this observation of mine your statement on page
10 second paragraph "LVS-DR with a single real server appears
to be on par ..." is only true up to the magic 4000 limit. Then
we have an exponential drift between the scalability of direct
connection and the DR over LVS. I assume by taking two key/value
pairs and solving the y=a*x^b equation by finding the (a/b)
parameter we might get the O(f(x)) relation for our overhead
introduced with the LVS code. Your mileage may vary. Wensong,
Julian?
o For Figure 5 ``SMP scaling on Linux 2.2.17'' I was wondering
if we could see the same linear improvement over UP if we do
following tests: 1 CPU @ 1 RS versus 4 CPU @ 1 RS.
o Page 14 you talk about the theoretical limit of 16000 conns/s
but you give no context of what should cause this limitation.
Can you enlighten me, please?
o Table 4 "Aggregate Network I/O", I'm afraid but I think the
two last values are not correct since in the LVS-DR setup the
replies don't go through the load balancer. Only the SYN, the
ACK, the FIN or FIN ACK, but no payload for the loadbalancer
and I doubt you get 12.5 Mbit/s I/O only for simple TCP state
flags. Please check the numbers again and report me and in
case I'm wrong, I stand corrected.
o Now a really astonishing fact is that you obivously found out
that the do_aic7xxx_isr() function is being called for every
packet received. Well, I have not yet full indepth knowledge
of the 2.4 kernel internals but I know that the aic7xxx.c driver
is currently frequently changed by Gibbs and if you read the lkml
you'll see nearly every week a new big patch for this super-
broken driver. In your case, I would have put in cheap IDE disks
because disk I/O was not a important disturbing factor for
this test.
o Ok, I'm slowly comming to an end :). Page 19, Chapter 4.5
"LVS versus a hardware load balancer" is at least for me a
very interesting and important chapter. I've been performing
hundreds of tests and different setups with load balancers,
from Cisco over BigIP and SercerIron to ACEdirector[2|3|180].
I've collected various statements from their labs. Unfortunately
I've got a very limited budget and therefore a small 100Mbit
test lab but nevertheless I could see the tendency of cost
effectiveness as stated also by your article. I have some
additional input from my experiences with commerical load
balancers. Now, where should I start. Well, let's start by a
completely unrelated, stupid and non-rational (but sometimes
reflecting the truth) statement of mine: "Commercial load
balancers suck! They are expensive, not flexible, limited,
support is not technical enough, buggy and insecure." Choose
one of the listed topics and compare LVS with it on a commercial
base and try to weigth the impact/importance of your deployment
with a possible failure in one of those subjects with a
commercial load balancer. Yes I know, it is true, that they
may load balance faster, consume less memory and CPU, but
- how come I use my 16000$ ACEdirector2 as a stupid switch for
my LVS net? This is simple: Lack of thorough technical support
for my setup. Yes, I wanted to have full redundancy with
a real audio sticky stream setup for bigbrother.ch and yes,
I wanted to load balance more then 16 Realservers without
buying a stackable ACE unit for +7000$ and yes I wanted
security and not a buggy packet filtering implementation
based on a even buggier TCP/IP stack implementation.
- why for heaven's sake does the CISCO local director start
to try passing 9% NAT'd packets directly without NATing
them when the load rises under a SYN-flood attack? Yeah, it's
state transition table corruption and people don't care nor
realize it because there is still too less people looking
at a tcpdump or firewall/IDS logfile to check and verify the
correct packet handling of the closed black box "commercial
load balancer". But the manager sees the nice graphical chart
where it says "... up to 1'000'000 concurrent connections"
or "can loadbalance firewalls :)" and this convinces him enough
to buy a Gigabit load balancer for his E10k Sun cluster for
his very important e-commmerce project having a total through-
put in peak times of about 15Mbit/s. Your article helps opening
people's eyes to true cost effectivness and minimal operating
time. So, now let's get back to technical stuff and less OT
threads ... (BTW, this is my sole opinion and not my company's.)
- could anybody tell me how to perform and end-2-end healthcheck
for cluster availability for a 3 Tier Java architecture with a
commercial load balancer? You can do every OSI layer up to 5
defined check, but what if I have to check my database with
special input to be sure the service is still available? To add
this very important feature for datacenters and SLA's you have
to multiply the costs of a commercial load balancer by 2-3.
- how about technical help? I intend to optimize database load-
balancing which I do with the same load balancer box as I do
the frontend webserver load balancing. Now, I change the TCP
state transition to address the different behavior of databases
in TCP/IP networks. I tell this my local ServerIron vendor, ask
him to give me the equivalent setup because I see that with his
setup the mSQL db dies and rather then fixing the db I adjust
some TCP parameters. None response but yet another IOS patch ...
- SECURITY! Every commercial load balancer I've tested some time
ago had serious problems with forged TCP or UDP packets. I recall
the time where you could reboot a load balancer with nmap. With
the LVS you have a complete working packetfiltering and more
or less effective DoS defense strategies builtin.
- redundancy: Now this is one of the yet weak points of the LVS
load balancer you haven't explicitly addressed and also not brought
into your total cost of a load balancer setup. If for people
redundancy is an issue/subject as for deciding which load balancer
to deploy one has to consider that cost-effectivness rises even
more for the LVS approach because for every commercial HA setup
you have to buy either the second load balancer for the same prize
or you have to buy the additional software. For a HA LVS cluster
you only buy the hardware for the second load balancer node and
download the HA package and off you go.
o Last but not least (I will shut my mouth after this paragraph): A
little tip for the FD_SET problem in 2.2.x kernel: Use SIGIO and
drop the select system call and work with signals :) Your problems
are gone. Read the OLS paper written by Andi Kleen about queued
SIGIO last year or ask Julian Anastasov :). On page 23 you give 2
reasons/problems you experienced using the httperf tool. I suggest
tuning the TCP_TIMEWAIT and CLOSE_WAIT should improve the amount
of concurrent connections since the fd's can be closed faster.
Glancing at your e1000_kcompat.h patch I assume you loaded the
driver as a module? If so, why?
Finally I have to say again, that this is a very very good report
reflecting the real numbers and problems with LVS and it shows that
very talented and knowledgable guys have been performing these
tests. Thank you for doing it. Honestly, I was a little bit angry
I didn't get a previous draft of this test because working since
over 2 years for and with the LVS project I'm sure some of my stuff
here could have had impact on your material. I hope I could give
you a usefull summary from yet another perspective. Keep up the good
work and I'm looking forward to seeing more tests for LVS from you.
I'm happy that the LVS project made it from a more or less academical
proof of concept code to a fully comparative/equivalent load balancer
to commercial solutions.
Best regards, the crazy guy,
Roberto Nibali, ratz
|