Bizarre LVS oddity - one VIP handled find, another gives ip_rt

To:	lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject:	Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors
From:	John Line <jml4@xxxxxxxxxxxxxx>
Date:	Wed, 30 Nov 2005 15:04:19 +0000 (GMT)

I've encountered a very strange problem when attempting to migrate the LVSdirectors for our web cache from old hardware (using 2.4 kernel) to newhardware (with SLES9, 2.6 kernel, up-to-date with all relevant SuSE RPMupdates) - 32-bit x86 in both cases and with LVS configuration differingonly to the extent of adding more real-servers and adjusting which systemsare running as directors.

Checking past mail to lvs-users and searching with Google fails to findany mention of such a problem. Even if no-one's actually seen the problembefore, I hope someone may be able to work out either what's happening andhow to solve it, or suggest how to identify the cause.

We are using the LVS DR (Direct Routing) configuration, so that the highvolume of data from the web cache real-servers can go direct to theclients, bypassing the directors (which might well become a bottleneck ifthat system had to handle all the cache response data).

The virtual IP addresses and ports on which the (very busy) live serviceswould run are:


        131.111.8.1 port 8080 (TCP)     - web cache HTTP (Squid real-server)
        131.111.8.1 port 3130 (UDP)     - web cache ICP (Squid real-server)
        131.111.8.68 port 80 (TCP)      - WPAD HTTP (Web Proxy
                                          Auto-Discovery, Apache real-server)

Those are the actual addresses, and may be significant as the problemseems to be address-specific. The real-servers have 131.111.8.x canonicaladdresses, with the hidden interfaces used as targets for the LVS-managedservices set up as aliases of dummy0.

The LVS-related software (inc. kernel/modules) are the versions suppliedby SuSE in SLES9 (as modified by subsequent official updates) with thesole exception that we'd normally use a local version of ldirectord thathas been modified to send pager and email alerts when services/servers areadded or dropped. However, I just tried the unmodified original versionand it fails in the same way. The heartbeat and heartbeat-ldirectord RPMsare V1.2.3-2.9 and the ldirectord which they include is v1.77.2.10 (butI don't know if that includes any SuSE modifications).

The back-end systems are all capable of running both types of real-serverand would normally have both servers running and handling requests, thoughldirectord should adapt to whatever's available.

Tests (using alternative virtual IP addresses) before going live "justworked", but a major and bizarre problem turned up when trying it with thelive addresses and handling traffic for the entire university (in contrastto the very low volume of test requests that I'd tried with the testaddresses).

The problem is that while access to the web cache (131.111.8.1) throughthe new director "just worked", the WPAD requests almost all got stuck -shown as SYN_RECV in /proc/net/ip_vs_conn and reported in/var/log/messages as


... ip_rt_bug: [clientIPhere] -> 131.111.8.68, eth1

and corresponding packets do not get forwarded to a real-server.

I have looked for configuration errors etc., without success, and it seemsbizarre that packets for one service are handled as expected but the other(with almost identical configuration) fails for virtually all requests.I'm not sure that a configuration error would be able to cause whatappears to be a "should never happen" problem inside the kernel, anyway!

An additional oddity was that when testing while logged in from home via aVPDN connection, requests from my home PC (with an address assigned byDHCP, shouldn't be treated any different to other client systems withinthe university's network) *did* get passed through to a WPAD real-server,implying that the blockage is near-total but not 100%.

Any suggestions either for possible explanations/solutions or how toinvestigate the true cause of the problem would be gratefully received.However, since the failures occurs only when the new director is runninglive and meant to be handling WPAD requests for the entire university, anyinvestigation has to be very brief in order to minimise disruption.


                                John
--
John Line - web & news development, University of Cambridge Computing Service

<Prev in Thread]	Current Thread	[Next in Thread>
Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors, John Line <= Re: Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors, Graeme Fowler

Previous by Date:	Re: ldirectord: wrong RIP port in if VIP port != RIP port, sll
Next by Date:	Re: Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors, Graeme Fowler
Previous by Thread:	how to use adsl create cluster?, 凤翼天翔
Next by Thread:	Re: Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug errors, Graeme Fowler
Indexes:	[Date] [Thread] [Top] [All Lists]

Bizarre LVS oddity - one VIP handled find, another gives ip_rt_bug error