On Wed, 2008-05-28 at 07:27 -0600, Michael S. Moody wrote:
> This happened again today, dead servers were not being removed. I had to
> stop heartbeat, and allow the resources to transfer to the second load
> balancer. Something is seriously wrong, but I don't know what it is. It
> doesn't seem to happen on the second load balancer.
Looking at your strace (which I'll edit, and is missing timestamps -
next time if you can please use the "-tt" switch to get microsecond
timing) shows the following:
Setting up file descriptor 22, which is to be used to open a TCP stream
socket:
> socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 22
> ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd25ea8c0) = -1 EINVAL (Invalid
> argument)
> lseek(22, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
> ioctl(22, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd25ea8c0) = -1 EINVAL (Invalid
> argument)
> lseek(22, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
The arg/seek errors there are fine, so ignore them. Now it
sets/gets/sets flags:
> fcntl(22, F_SETFD, FD_CLOEXEC) = 0
> fcntl(22, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK) = 0
...and now we connect to your realserver:
> connect(22, {sa_family=AF_INET, sin_port=htons(21),
> sin_addr=inet_addr("192.168.1.195")}, 16) = -1 EINPROGRESS (Operation now in
> progress)
...and here, FD 22 is being prepared for read/write (I think!):
> select(24, NULL, [22], NULL, {0, 0}) = 1 (out [22], left {0, 0})
...and is now connected, so we get flags, set flags, and wait to read
from it:
> connect(22, {sa_family=AF_INET, sin_port=htons(21),
> sin_addr=inet_addr("192.168.1.195")}, 16) = 0
> fcntl(22, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl(22, F_SETFL, O_RDWR) = 0
The read/write operation times out:
> select(24, [22], NULL, NULL, {0, 0}) = 0 (Timeout)
...and FD22 - the FTP connection - is closed.
> close(22) = 0
Rinse, repeat, etc.
The lack of timestamps is a bit of a blocker here, as there's no way to
discern how long ldirectord is waiting before the timeouts occur.
I'll suggest one thing, however: does the affected realserver have the
exact same hosts file (with obvious differences if that isn't a complete
oxymoron) and resolver configuration as the working one?
It strikes me that the connection is timing out because the FTP daemon
or xinetd, or some other wrapper, is trying to do a reverse DNS lookup
of the calling IP and that's the part causing the timeout - if the
daemon has to wait for a lookup to complete before returning the banner,
perhaps ldirectord's timeout is less than that so it gives up and moves
on?
I think you've unearthed a config problem in your local setup, but it
could be a bug. Let's go with making sure the realserver knows who
everyone is first.
Graeme
|