Hi folks,
I have the fortunate opportunity to have two IBM servers running debian
etch on which I get to setup an LVS system to experiment with settings
and such. We run a heartbeat/ldirectord setup on some live servers right
now (set up before my time) and we have experienced some oddities and
issues along the way, so this is a chance to play with them in a
controlled environment, finally.
I will try to make this as detailed as possible without being overly
verbose, so if I leave out any important details, just ask and I will
provide any extra info. I did this configuration based on an online
tutorial and it's mostly working great, with the exception of what I
note in the subject line, on which I will elaborate shortly.
Firstly, the directors, of which there are two:
192.168.1.210 ld1
192.168.1.211 ld2
Secondly, the real servers (http is the only service in the experiment
to keep things simple):
192.168.1.100 web1
192.168.1.200 web2
Finally, the virtual IP being used is 192.168.1.240 and both web1 and
web2 have this virtual IP setup in /etc/network/interfaces as lo:0.
Here's the contents of ha.cf on ld1 and ld2:
--- begin ha.cf ---
ogfacility local0
bcast eth0
mcast eth0 225.0.0.1 694 1 0
auto_failback off
node ld1
node ld2
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster
--- end ha.cf ---
And authkeys is awfully straight forward, but here it is:
--- begin authkeys ---
auth 3
3 md5 SecretPassword
--- end authkeys ---
And that leaves us with ldirectord.cf:
--- begin ldirectord.cf ---
checktimeout=10
checkinterval=2
autoreload=yes
logfile="/var/log/ldirectord.log"
quiescent=no
virtual=192.168.1.240:80
real=192.168.1.200:80 gate
real=192.168.1.100:80 gate
fallback=127.0.0.1:80 gate
service=http
request="ldirector.html"
receive="OK"
scheduler=nq
protocol=tcp
checktype=negotiate
--- end ldirectord.cf ---
On each of the web servers, the webroot contains two files.
ldirector.html, which simply contains "OK", and index.html, which states
the name of the server (either "Welcome to web1" or "Welcome to web2")
purely for testing purposes, of course.
So, once I get things going correct so that ld1 is running as master and
ld2 as slave, stuff works as expected. If I change ldirector.html to
print "NO" instead of "OK", the server in question is almost immediately
dropped from the ipvsadm output.
Now for the problem (sorry if this has taken too long). I am using a
windows laptop kicking around here as the client to test the setup. I
point my clunky IE browser toward 192.168.1.240 and, expectedly, I get
"Welcome to web1". So, I go on to web1 and modify ldirector.html to say
"NO" and then check ipvsadm on ld1 and, sure enough, that machine has
now been dropped from the list of real servers. I refresh the page, but
I still get web1! Testing from other computers in the office has
dissimilar results, though. For instance the fellow next to me using a
ubuntu workstation gets web2 (the correct result) in this case, while
the laptop, for whatever reason, continues being sent to web1, despite
the fact that it has been pronounced dead by ldirector. I have made sure
it's not a cache issue on the laptop by clearing all the internet cache
and shift-refreshing and, in desperation, I even rebooted windows in
case it was some network silliness. No luck, though.
If I shut down ld1 to force ld2 to takeover, sometimes the client will
then get send to web2, however other times it has not, and there doesn't
seem to be consistency (that I can find) in the behavior.
As far as scheduling is concerned, I have tried a number of different
schedulers (nq, sh, rr, wrr) and the results have not been consistently
correct with any. Furthermore, the windows laptop isn't the only client
I was able to experience the oddities on, another ubuntu station running
firefox also was being delivered to the incorrect web server for a
period.
Any suggestions on how I can fix or at least track down this issue would
be greatly appreciated.
Justin
|