Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug error

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors
From: John Line <jml4@xxxxxxxxxxxxxx>
Date: Wed, 30 Nov 2005 15:04:19 +0000 (GMT)
I've encountered a very strange problem when attempting to migrate the LVS directors for our web cache from old hardware (using 2.4 kernel) to new hardware (with SLES9, 2.6 kernel, up-to-date with all relevant SuSE RPM updates) - 32-bit x86 in both cases and with LVS configuration differing only to the extent of adding more real-servers and adjusting which systems are running as directors.

Checking past mail to lvs-users and searching with Google fails to find any mention of such a problem. Even if no-one's actually seen the problem before, I hope someone may be able to work out either what's happening and how to solve it, or suggest how to identify the cause.

We are using the LVS DR (Direct Routing) configuration, so that the high volume of data from the web cache real-servers can go direct to the clients, bypassing the directors (which might well become a bottleneck if that system had to handle all the cache response data).

The virtual IP addresses and ports on which the (very busy) live services would run are: port 8080 (TCP)     - web cache HTTP (Squid real-server) port 3130 (UDP)     - web cache ICP (Squid real-server) port 80 (TCP)      - WPAD HTTP (Web Proxy
                                          Auto-Discovery, Apache real-server)

Those are the actual addresses, and may be significant as the problem seems to be address-specific. The real-servers have 131.111.8.x canonical addresses, with the hidden interfaces used as targets for the LVS-managed services set up as aliases of dummy0.

The LVS-related software (inc. kernel/modules) are the versions supplied by SuSE in SLES9 (as modified by subsequent official updates) with the sole exception that we'd normally use a local version of ldirectord that has been modified to send pager and email alerts when services/servers are added or dropped. However, I just tried the unmodified original version and it fails in the same way. The heartbeat and heartbeat-ldirectord RPMs are V1.2.3-2.9 and the ldirectord which they include is v1.77.2.10 (but I don't know if that includes any SuSE modifications).

The back-end systems are all capable of running both types of real-server and would normally have both servers running and handling requests, though ldirectord should adapt to whatever's available.

Tests (using alternative virtual IP addresses) before going live "just worked", but a major and bizarre problem turned up when trying it with the live addresses and handling traffic for the entire university (in contrast to the very low volume of test requests that I'd tried with the test addresses).

The problem is that while access to the web cache ( through the new director "just worked", the WPAD requests almost all got stuck - shown as SYN_RECV in /proc/net/ip_vs_conn and reported in /var/log/messages as

... ip_rt_bug: [clientIPhere] ->, eth1

and corresponding packets do not get forwarded to a real-server.

I have looked for configuration errors etc., without success, and it seems bizarre that packets for one service are handled as expected but the other (with almost identical configuration) fails for virtually all requests. I'm not sure that a configuration error would be able to cause what appears to be a "should never happen" problem inside the kernel, anyway!

An additional oddity was that when testing while logged in from home via a VPDN connection, requests from my home PC (with an address assigned by DHCP, shouldn't be treated any different to other client systems within the university's network) *did* get passed through to a WPAD real-server, implying that the blockage is near-total but not 100%.

Any suggestions either for possible explanations/solutions or how to investigate the true cause of the problem would be gratefully received. However, since the failures occurs only when the new director is running live and meant to be handling WPAD requests for the entire university, any investigation has to be very brief in order to minimise disruption.

John Line - web & news development, University of Cambridge Computing Service

<Prev in Thread] Current Thread [Next in Thread>