I've encountered a very strange problem when attempting to migrate the LVS
directors for our web cache from old hardware (using 2.4 kernel) to new
hardware (with SLES9, 2.6 kernel, up-to-date with all relevant SuSE RPM
updates) - 32-bit x86 in both cases and with LVS configuration differing
only to the extent of adding more real-servers and adjusting which systems
are running as directors.
Checking past mail to lvs-users and searching with Google fails to find
any mention of such a problem. Even if no-one's actually seen the problem
before, I hope someone may be able to work out either what's happening and
how to solve it, or suggest how to identify the cause.
We are using the LVS DR (Direct Routing) configuration, so that the high
volume of data from the web cache real-servers can go direct to the
clients, bypassing the directors (which might well become a bottleneck if
that system had to handle all the cache response data).
The virtual IP addresses and ports on which the (very busy) live services
would run are:
131.111.8.1 port 8080 (TCP) - web cache HTTP (Squid real-server)
131.111.8.1 port 3130 (UDP) - web cache ICP (Squid real-server)
131.111.8.68 port 80 (TCP) - WPAD HTTP (Web Proxy
Auto-Discovery, Apache real-server)
Those are the actual addresses, and may be significant as the problem
seems to be address-specific. The real-servers have 131.111.8.x canonical
addresses, with the hidden interfaces used as targets for the LVS-managed
services set up as aliases of dummy0.
The LVS-related software (inc. kernel/modules) are the versions supplied
by SuSE in SLES9 (as modified by subsequent official updates) with the
sole exception that we'd normally use a local version of ldirectord that
has been modified to send pager and email alerts when services/servers are
added or dropped. However, I just tried the unmodified original version
and it fails in the same way. The heartbeat and heartbeat-ldirectord RPMs
are V1.2.3-2.9 and the ldirectord which they include is v1.77.2.10 (but
I don't know if that includes any SuSE modifications).
The back-end systems are all capable of running both types of real-server
and would normally have both servers running and handling requests, though
ldirectord should adapt to whatever's available.
Tests (using alternative virtual IP addresses) before going live "just
worked", but a major and bizarre problem turned up when trying it with the
live addresses and handling traffic for the entire university (in contrast
to the very low volume of test requests that I'd tried with the test
addresses).
The problem is that while access to the web cache (131.111.8.1) through
the new director "just worked", the WPAD requests almost all got stuck -
shown as SYN_RECV in /proc/net/ip_vs_conn and reported in
/var/log/messages as
... ip_rt_bug: [clientIPhere] -> 131.111.8.68, eth1
and corresponding packets do not get forwarded to a real-server.
I have looked for configuration errors etc., without success, and it seems
bizarre that packets for one service are handled as expected but the other
(with almost identical configuration) fails for virtually all requests.
I'm not sure that a configuration error would be able to cause what
appears to be a "should never happen" problem inside the kernel, anyway!
An additional oddity was that when testing while logged in from home via a
VPDN connection, requests from my home PC (with an address assigned by
DHCP, shouldn't be treated any different to other client systems within
the university's network) *did* get passed through to a WPAD real-server,
implying that the blockage is near-total but not 100%.
Any suggestions either for possible explanations/solutions or how to
investigate the true cause of the problem would be gratefully received.
However, since the failures occurs only when the new director is running
live and meant to be handling WPAD requests for the entire university, any
investigation has to be very brief in order to minimise disruption.
John
--
John Line - web & news development, University of Cambridge Computing Service
|