LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Busted Cluster

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Busted Cluster
From: Rob <ipvsuser@xxxxxxxxxxxxxxxx>
Date: Sun, 13 Mar 2005 04:40:15 -0800
Since no else is on right now, I'll offer this from my own experience.

I had a high number of inactive connections with apache set up to not use keepalive at all. After activating keep alive in apache (LVS was already persist) the number of inactive connections went way down.

So in my case at least, it was connections that were setup, used for a single GET for a gif, button, jpeg, js script, or other page component then the server closed the connection, only to open another for the next gif, etc.

You might be able to use something like multilog to watch a bunch of the logs at the same time to get an idea if the traffic looks like real people (get page 1, get page 1 images, get page 2, get page 2 images) or if it is random hammering from a dos attack.

I wrote a small shell script that pulled the recent log entries, counted the hits per IP address for certain requests and then created a iptables rule on the director (or some machine in front of the director) to tarpit requests from that IP. This worked in my situation because we knew that certain URLs were only hit a small number of times during a legit use session (like a login page shouldn't be hit 957 times in an hour by the same external IP) This could help reduce the tide of requests if you are actually encountering a (d)dos. I ran it every 12 minutes or so. If you are getting ddos'd the tarpit function of iptables http://www.securityfocus.com/infocus/1723 or the tarpit standalone can be a great help. Also, Felix and his company seem to have helped some large companies deal with high traffic ddos attacks - http://www.fefe.de/

BTW, You might be interested in http://www.backhand.org/mod_log_spread/ for centralized and redundant logging. That way you can run different kinds of real time analysis with no extra load on the webservers or the normal logging hosts by just having an additional machine join/subscribe to the multicast spread group with the log data.

Rob

OK I can't find my script, but this was the start of it, it is hardly a shell script (but someone may find it useful): Add a "grep blah" command just before the awk '{print $2}' if you want just certain requests or other filtering.

multidaychk.sh
#!/bin/sh
# look for mutliday patterns
# $1 is how many days back to search
# $2 is how many high usage IPs to list
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -${1} | xargs -n 1 cat | awk '{print $2}' | sort | uniq -c | sort -nr | head -${2}

byhrchk.sh
#!/bin/sh
# looks for IPs hitting during a certain hr of the day
# $1 is how many days back to search
# $2 is how many high usage IPs to list
# $3 is which hour of the day
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -${1} | xargs -n 1 cat | fgrep "2005:${3}" | awk '{print $2}' | sort | uniq -c | sort -nr | head -${2}

recentchk.sh
#!/bin/sh
# This just checks the latest X lines from the newest log file
# $1 is how many lines from the file
# $2 is how many high usage IPs to list
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -1 | xargs -n 1 tail -${1} | awk '{print $2}' | sort | uniq -c | sort -nr | head -${2}

HTH

nigel@xxxxxxxxxxx wrote:
Hi,

      Now the bad news. This weekend the web service we run came under 
increased load --- about an extra 10,000,000 queries per day ---- and we now 
have a busted cluster. Here is what IPVS looks like:

IP Virtual Server version 1.0.10 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  66.98.x.y:80 rr
  -> 66.98.x.y:80              Tunnel  1      37         337
  -> 67.15.x.y:80               Tunnel  1      14         382
  -> 66.98.x.y:80              Tunnel  1      6          131
  -> 207.44.x.y:80             Tunnel  1      21         325
  -> 66.98.x.y:80              Tunnel  1      57         422
  -> 207.44.x.y:80             Tunnel  1      12         354
  -> 69.57.x.y:80              Tunnel  1      33         355
  -> 67.15.x.y:80                Tunnel  1      71         274
  -> 67.15.x.y:80               Tunnel  1      12         378
  -> 207.44.x.y:80             Tunnel  1      5          345
  -> 66.98.x.y:80               Tunnel  1      59         301
  -> 67.15.x.y:80               Tunnel  1      2          347
  -> 67.15.x.y:80               Tunnel  1      19         375
  -> 69.57.x.y:80              Tunnel  1      10         132
  -> 69.57.x.y:80              Tunnel  1      3          128
  -> 67.15.x.y:80               Tunnel  1      15         361
  -> 69.57.x.y:80              Tunnel  1      8          128
  -> 67.15.x.y:80               Tunnel  1      229        303
  -> 67.15.x.y:80               Tunnel  1      16         372
  -> 67.15.x.y:80               Tunnel  1      125        317
  -> 67.15.x.y:80               Tunnel  1      12         367
  -> 207.44.x.y:80             Tunnel  1      13         333
  -> 207.44.x.y:80             Tunnel  0      144        5
  -> 66.98.x.y:80              Tunnel  1      10         404
  -> 207.44.x.y:80             Tunnel  0      0          0
  -> 207.44.x.y:80             Tunnel  1      132        277

 At this point the service works but is too slow. But in the next 60 seconds 
the - InActConn count grows to over 2000+ per real server - and the whole thing 
locks up.

* What precisely does the InActConn figures show?

Is this symptomatic of simply an overloaded cluster - or could it be a DOS  
problem.

Any insights or similar experiences would be much appreciated?

Kind regards,


Nigel

<Prev in Thread] Current Thread [Next in Thread>