Hello Jeremy,
First of all, would you mind not inlining a whole posting when replying
unless you refer to some specific part of the text? It makes it easier
to read postings, thank you.
I've setup a script to run which calls all of your requested commands, plus
dmesg. I'll show you what I get now when I'm not experiencing problems. I
shut down the cron job that brings eth2 down and up, and hopefully I'll get
the problem before the day is over.
I think I've already found your problems.
When director goes down, heartbeat tells director2 to bring up all it's eth0
interfaces, sends out arps for them, and also changes it's eth1 address to
10.75.0.1, so it's now the DGW for the realservers. This seems to work
fine, and it's been getting a real workout for awhile now. :)
Ok, thanks for the explanation, if only every poster here with problems
would be as detailed and specific as you are ... ;)
We have another web server on the same network as the director. Twice
during the director's problems I got alerts that the the other webserver was
down too. I logged into that box and saw a bunch of httpd processes
running, a lot more then normal. Looking at the apache log files I saw
there were a bunch of SSL handshake errors. This sounds like the new apache
mod_ssl worm that's out there. All of our openssls have been upgraded and
mod_ssl/apache recompiled. I think it may have been infected servers
hitting our servers trying to figure out if we were exploitable.
Ok.
Another thing I found during these problems through ntop is sometimes a huge
spike of mail will come in. I think these are spammers doing dictionary
attacts on us. We have a few tens of thousand email accounts on the
realservers, so you can imagine the spam that comes into our network.
I hope you know how to start proper countermeasures against these 'attacks'
The thing is, I've seen those two issues durring the buffer space problem,
but not all of the time. Sometimes one, sometimes the other, sometimes
neither. Don't know if it's coincident or not.
I think it has to do with the net_ratelimit() or the gc_treshold. See
further below.
4: eth2: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
link/ether 00:01:03:e4:4b:93 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
4827139 25897 0 0 0 0
RX errors: length crc frame fifo missed
0 0 0 25 0
^^^^
not much, but still.
[deleted the IP addresses for now]
-------------------------------------------------
cat /proc/net/softnet_stat
0163073a 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00007cff
0162c847 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00007c00
Ok
-------------------------------------------------
cat /proc/net/rt_cache_stat
00000b2d 0138e91f 0021bb47 00000000 00000000 00000441 00000000 00000025
00368c6a 000a014c 00000444 000a03fe 0009fe0c 0000006c 00000000
00000b2d 0138ae70 0021b50c 00000000 00000000 00000414 00000000 00000020
003697a3 0009e898 000004bd 000a033e 0009fcf4 00000061 00000000
Ok
cat /proc/slabinfo
slabinfo - version: 1.1 (SMP)
kmem_cache 80 80 244 5 5 1 : 252 126
ip_conntrack 1884 3806 352 275 346 1 : 124 62
ip_fib_hash 339 339 32 3 3 1 : 252 126
ip_vs_conn 8322 11850 128 290 395 1 : 252 126
tcp_tw_bucket 420 420 128 14 14 1 : 252 126
tcp_bind_bucket 326 452 32 4 4 1 : 252 126
tcp_open_request 280 280 96 7 7 1 : 252 126
inet_peer_cache 408 1416 64 24 24 1 : 252 126
ip_dst_cache 3074 6980 192 254 349 1 : 252 126
arp_cache 1044 1170 128 39 39 1 : 252 126
Aha, might get full soon.
blkdev_requests 400 400 96 10 10 1 : 252 126
nfs_write_data 132 132 352 12 12 1 : 124 62
nfs_read_data 132 132 352 12 12 1 : 124 62
nfs_page 280 280 96 7 7 1 : 252 126
journal_head 324 2340 48 7 30 1 : 252 126
revoke_table 126 253 12 1 1 1 : 252 126
revoke_record 226 226 32 2 2 1 : 252 126
dnotify cache 0 0 20 0 0 1 : 252 126
file lock cache 126 126 92 3 3 1 : 252 126
fasync cache 0 0 16 0 0 1 : 252 126
uid_cache 226 226 32 2 2 1 : 252 126
skbuff_head_cache 582 960 192 30 48 1 : 252 126
sock 184 184 928 46 46 1 : 124 62
sigqueue 261 261 132 9 9 1 : 252 126
cdev_cache 1239 1239 64 21 21 1 : 252 126
bdev_cache 118 118 64 2 2 1 : 252 126
mnt_cache 118 118 64 2 2 1 : 252 126
---------------------
inode_cache 114100 114100 512 16300 16300 1 : 124 62
dentry_cache 116370 116370 128 3879 3879 1 : 252 126
Jeez' what the hell are you running on this box?
192.168.0.128 sent an invalid ICMP error to a broadcast.
192.168.0.128 sent an invalid ICMP error to a broadcast.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Ok, try following
echo "4096" > /proc/sys/net/ipv4/neigh/default/gc_thresh3
and try to ping again and check dmesg.
192.168.0.128 sent an invalid ICMP error to a broadcast.
IPVS: incoming ICMP: failed checksum from 65.113.143.64!
:) Julian, look at that!
Best regards,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc
|