LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: System crash

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: System crash
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Sat, 02 Oct 2004 13:37:20 +0200
Hello Sebastien,

Ah, I'm afraid you need to enable some debugging flags in your kernel (CONFIG_FRAME_POINTER, CONFIG_DEBUG_SPINLOCK_SLEEP, CONFIG_DEBUG_SPINLOCK, and of course CONFIG_DEBUG_KERNEL).
Greping .config produces the following :
# CONFIG_FRAME_POINTER is not set
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y

How can that be ??

CONFIG_DEBUG_KERNEL=y

Well, maybe you can't recompile your kernel but then you have to hook up a serial console to your crashing machine. Also make sure you're not running in X while this happens. But I assume someone running a server would not install X anyways ;).

A bit long, sent privately ...

I've seen it and I don't really think there's anything special in it.

o biosdecode
command not found

Not so important, I saw the relevant pieces in the dmidecode output.

o .config (evt. zcat /proc/config.gz)
file not found

You must have some .config file for your kernel configuration.

Don't know, I've only seen one stack trace ... and still hoping it won't crash again.

I thought you had multiple crashes already?

First problem, Alt-SysRq-t produces tons of output. I can only copy the few ending lines. Second problem, I don't know how to use ksysoops.

You then must hook up a serial connection to your machine. ksymoops is self-explanatory, just dump your output into a file and run ksymoops < file.

Yes, for me the following line in the bootloader configuration helped:
append="pci=noapic"

I'll try that if it crashes again (which is likely as I haven't change anything kernel related since the last crash).

The reason for this is that ACPI itself is rather flaky and having the PCI routing go over the APIC can cause major havoc on newer motherboards. Intel is working on providing the necessary patches but as it seems that ACPI specification is not exactly a piece of cake, plus there's probably not two motherboard manufacturers that interprete it's implementation in the same way. Since you've got a NIC which has shown major issues of stability in the past (as also noted by others in this thread) I suspect this could be the problem.

The bit of you saying that the end of the trace showed IPVS related information is another indication that on the top stack it would have been a call to the networking API and then onto the NIC driver's hooks.

It's all speculation of course but we have given you a few suggestions on how you can narrow down the cause of those misfortunate events.

Best regards and good luck,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
<Prev in Thread] Current Thread [Next in Thread>