LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

firewall marks + tunneling + persistence = ERR! state

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: firewall marks + tunneling + persistence = ERR! state
From: Jaroslav Libák <jarol1@xxxxxxxxx>
Date: Tue, 28 Nov 2006 21:32:09 +0100 (CET)
Hello

I have 2 directors, 2 real servers (with more than 4 realservers in the future) 
with heartbeat 2.0.7 + LVS (built in 2.6.18.3 kernel) + ldirectord all compiled 
from latest sources as packages aren't available for Oracle enterprise linux 
4u4.
I tried NAT/DR/TUN and got them working without problems (the biggest problem 
was getting heartbeat 2.x setup). I decided to use tunneling as it seems to be 
very fast and real servers don't have to be in the same network. I'm using 
firewall marks and persistence at the same time. Everything seems to be working 
right (connections are really persistent),
except the ipvsadm -l -c which shows weird data.

When i run that I get some connections with ERR! state. Persistence is 600 = 10 
minutes, after that these connections dissappear. Without persistence there are 
no such connections. If I don't use firewall marks then they aren't there 
either. If I don't use firewall marks, then there are "NONE" connections which 
from what I have read LVS uses to handle persistence. These "connections" 
resemble my ERR! connections in this sence. After they dissappear client can be 
routed to a different real server.

Could anyone confirm that in this case this ERR! state is harmless? I'm 
thinking that it might be happening because usage of firewall marks was added 
to LVS later and ipvsadm wasn't updated to handle this properly. Or when using 
firewall marks and persistence, somebody forgot to change the state of the 
connection to "NONE" in the C code.

Another issue is master-backup synchronization. Both directors are running in 
master-slave mode at the same time. (in heartbeat terms they are set up as 
symmetric cluster with resource stickiness) When I click refresh in firefox 
several times while viewing load balanced page, I get a FIN_WAIT connection for 
every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid 
of them fast (is this ok btw?, it was like 2 minutes before which I think is 
way too long). What is worse, I get "established" connection on the slave for 
every refresh. I have read this is due to a simplification in the 
synchronization code. These "established" connections on the slave have much 
longer timeout unfortunately (2 minutes). I'm worried that this could cause the 
slave director to crash faster than the master during a DOS attack. Master 
would remove FIN_WAITs quite fast but slave wouldn't. I have put all 3 defence 
strategies to auto mode just in case.

I'm using hash table size 2^20 (which doesn't limit the maximum number of 
values in it, it just sets the number of rows, then each row has a linked 
list). Doesn't it cause some slowdown in the LVS? I hope the code doesn't do 
iteration of all table values. 

Jaro

<Prev in Thread] Current Thread [Next in Thread>