I have 2 directors, 2 real servers (with more than 4 realservers in the future)
with heartbeat 2.0.7 + LVS (built in 126.96.36.199 kernel) + ldirectord all compiled
from latest sources as packages aren't available for Oracle enterprise linux
I tried NAT/DR/TUN and got them working without problems (the biggest problem
was getting heartbeat 2.x setup). I decided to use tunneling as it seems to be
very fast and real servers don't have to be in the same network. I'm using
firewall marks and persistence at the same time. Everything seems to be working
right (connections are really persistent),
except the ipvsadm -l -c which shows weird data.
When i run that I get some connections with ERR! state. Persistence is 600 = 10
minutes, after that these connections dissappear. Without persistence there are
no such connections. If I don't use firewall marks then they aren't there
either. If I don't use firewall marks, then there are "NONE" connections which
from what I have read LVS uses to handle persistence. These "connections"
resemble my ERR! connections in this sence. After they dissappear client can be
routed to a different real server.
Could anyone confirm that in this case this ERR! state is harmless? I'm
thinking that it might be happening because usage of firewall marks was added
to LVS later and ipvsadm wasn't updated to handle this properly. Or when using
firewall marks and persistence, somebody forgot to change the state of the
connection to "NONE" in the C code.
Another issue is master-backup synchronization. Both directors are running in
master-slave mode at the same time. (in heartbeat terms they are set up as
symmetric cluster with resource stickiness) When I click refresh in firefox
several times while viewing load balanced page, I get a FIN_WAIT connection for
every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid
of them fast (is this ok btw?, it was like 2 minutes before which I think is
way too long). What is worse, I get "established" connection on the slave for
every refresh. I have read this is due to a simplification in the
synchronization code. These "established" connections on the slave have much
longer timeout unfortunately (2 minutes). I'm worried that this could cause the
slave director to crash faster than the master during a DOS attack. Master
would remove FIN_WAITs quite fast but slave wouldn't. I have put all 3 defence
strategies to auto mode just in case.
I'm using hash table size 2^20 (which doesn't limit the maximum number of
values in it, it just sets the number of rows, then each row has a linked
list). Doesn't it cause some slowdown in the LVS? I hope the code doesn't do
iteration of all table values.