Hi,
I've implemented an LVS/TUN cluster to split web traffic amongst 26+
nodes with a replicated MySQL database on each node.
First, the good news. LVS has been performing well! So too has MySQL
replication which seems to be handling 26-30 nodes fine (although I wonder how
far it can go?).
Although, periodically we notice that an incoming request to the cluster
hangs forever. Has anyone else experienced this? Is it LVS related? But this is
a relatively small problem.
Now the bad news. This weekend the web service we run came under
increased load --- about an extra 10,000,000 queries per day ---- and we now
have a busted cluster. Here is what IPVS looks like:
IP Virtual Server version 1.0.10 (size=65536)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 66.98.x.y:80 rr
-> 66.98.x.y:80 Tunnel 1 37 337
-> 67.15.x.y:80 Tunnel 1 14 382
-> 66.98.x.y:80 Tunnel 1 6 131
-> 207.44.x.y:80 Tunnel 1 21 325
-> 66.98.x.y:80 Tunnel 1 57 422
-> 207.44.x.y:80 Tunnel 1 12 354
-> 69.57.x.y:80 Tunnel 1 33 355
-> 67.15.x.y:80 Tunnel 1 71 274
-> 67.15.x.y:80 Tunnel 1 12 378
-> 207.44.x.y:80 Tunnel 1 5 345
-> 66.98.x.y:80 Tunnel 1 59 301
-> 67.15.x.y:80 Tunnel 1 2 347
-> 67.15.x.y:80 Tunnel 1 19 375
-> 69.57.x.y:80 Tunnel 1 10 132
-> 69.57.x.y:80 Tunnel 1 3 128
-> 67.15.x.y:80 Tunnel 1 15 361
-> 69.57.x.y:80 Tunnel 1 8 128
-> 67.15.x.y:80 Tunnel 1 229 303
-> 67.15.x.y:80 Tunnel 1 16 372
-> 67.15.x.y:80 Tunnel 1 125 317
-> 67.15.x.y:80 Tunnel 1 12 367
-> 207.44.x.y:80 Tunnel 1 13 333
-> 207.44.x.y:80 Tunnel 0 144 5
-> 66.98.x.y:80 Tunnel 1 10 404
-> 207.44.x.y:80 Tunnel 0 0 0
-> 207.44.x.y:80 Tunnel 1 132 277
At this point the service works but is too slow. But in the next 60 seconds
the - InActConn count grows to over 2000+ per real server - and the whole thing
locks up.
* What precisely does the InActConn figures show?
Is this symptomatic of simply an overloaded cluster - or could it be a DOS
problem.
Any insights or similar experiences would be much appreciated?
Kind regards,
Nigel
|