I had a network expert over this morning to discuss our problems. We
arrived at this possible explanation:
1) LVS/DR is an exotic technique which somewhat violates the
network-protocols involved. In itself this does not have to be a
problem, but it is very likely we are executing kernel logic which has
not been used/tested much compared to the logic executed when receiving
NAT packets for example.
2) The director's ipvs connection-table gives a very bad picture of the
real world's connection-situation when using LVS/DR.
Our "conclusion" at this stage is that our kernels are leaking some
fundamental communication (tcp/ip?) primitives due to the "exotic" logic
executed when receiving LVS/DR-packets. When these primitives run short
processes start to freeze, waiting for the kernel-primitives to be freed
(which does not happen). Unfortunately we have not seen or identified
these primitives. (I have not seen atoms either, but I'm sure they
We will most likely look into switching to LVS/NAT, which adhere better
to network protocols and which kernel-logic may be less exotic. Whether
it is more used/better tested I don't know? What do you think? It would
also feel better having the director see the return traffic and being
able to track closing connections.
A question. Is it possible to use DR and NAT at the same time in the
same system? We will have to test LVS/NAT and I'd really like to do it
on our production platform (well, we only hav one LVS, and setting up
What do you think of our conclusions and our planned route?