Hi Leon
I think I've found a fix for this problem for my configuration.
Like you I couldn't reproduce the problem on every real-server - for two
out of three taking the services off-line caused ldirectord to fail but
off-lining the other server caused no problems. They are virtually
identical hardware and configuration except for slight differences in CPU
speed.
>
> I also noticed that the problem seems to occur with only 1 server! Very
> strange as this server is also on a VMWare box and actually is an exact
> clone of the other Citrix terminal server (which ldirectord can check
> without a problem).
>
I think the issue was fixed in ldirectord v1.77.2.39 - there's a reference
to a "race condition in the connect and sip checks" which is a timing
issue which could explain why it shows up with some real servers and not
others.
I tried 4 versions of ldirectord with the following results:
1.62.2.6 ( from my old directors ):
Works fine
1.77.2.32 ( from the heartbeat-ldirectord-1.2.3.cvs.20050927-1.rh.el.um.1
RPM downloaded from the Ultramonkey 3 download page )
Fails with the "Alarm" message when services go off-line on some servers
1.77.2.41 ( latest release from CVS ):
Seems to fix the Alarm problem but when run from heartbeat it didn't
produce any log output in /var/log/ldirectord ( I didn't spend much time
investigating but it was the same on both directors )
1.77.2.39 ( from CVS, the version that fixes the "race condition" problem:
Seems to fix the Alarm problem and also produces log output as expected.
I've therefore left my setup running with ldirectord 1.77.2.39 ( just
copied the ldirectord file into /usr/sbin for now ) so I can monitor this
over the next few days. I've also written a simple script I'm running
from cron on both directors every 5 minutes that will email me if the
"ipvs_syncmaster" process is running and /etc/ha.d/resource.d/ldirectord
is not - that way at least I get to know about the problem before it bites
me!
Hope this is helpful to you.
Regards,
Peter.
|