LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Lost packets and dead/warntime

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Lost packets and dead/warntime
From: "Sebastian Vieira" <sebvieira@xxxxxxxxx>
Date: Fri, 1 Sep 2006 10:27:28 +0200
On 8/18/06, Graeme Fowler <graeme@xxxxxxxxxxx> wrote:

Beyond ensuring that the machines' network settings are good, that
they're not accumulating errors at the hardware level (check ifconfig
output), and that they're not interrupting themselves off the planet
(/proc/interrupts is a good place to start), I have no idea.


Hi. Sorry for the late reply. Work work work and no play.

I've checked ifconfig output and see this:

eth2      Link encap:Ethernet  HWaddr 00:02:A5:08:E3:73
         inet addr:172.16.0.102  Bcast:172.16.255.255  Mask:255.255.0.0
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:76727063 errors:3044 dropped:0 overruns:0 frame:3044
         TX packets:76774485 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:3882941796 (3703.0 Mb)  TX bytes:3900571779 (3719.8 Mb)


eth2      Link encap:Ethernet  HWaddr 00:02:A5:09:79:CD
         inet addr:172.16.0.101  Bcast:172.16.255.255  Mask:255.255.0.0
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:1432209 errors:156 dropped:0 overruns:0 frame:156
         TX packets:1432784 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:232381530 (221.6 Mb)  TX bytes:230872325 (220.1 Mb)


Now i don't know for sure where the errors come from, or what 'frame' means,
but i'm sure it's not very good. I've looked into /proc/interrupts and i see
that on one box all nics are sharing int15, on the other int11. But there's
a huge number in front of the interrupt that keeps changing (increasing). I
suppose that's not very good either:

11:  369029512          XT-PIC  eth2, eth0, eth1

15:    3131945          XT-PIC  eth2, eth0, eth1


It still sounds to me like the fault lies below the application layer.

Speaking of interrupts; you say you have eth0/1 bonded. Please make sure
that you haven't got several hundred megs worth of traffic looping


I would love to, but i don't know how.

around your ethernet because of that. If you have you could be dropping
packets simply because your kernels cannot keep up with the traffic load
- a layer 2 loop somewhere could cause an effective DoS condition like
this quite trivially.

What mode is your bond interface in?


active-slave

I've never used heartbeat, so I can't really suggest anything else.
Anyone else got any clever ideas?

Graeme


Thanks,

Sebastian

<Prev in Thread] Current Thread [Next in Thread>