Hello, a colleague of Jarrod's here. I was assigned to look into this
issue and have identified the cause. I'm not familiar with the
underlying kernel code, so please bear with me if I seem to be mistaken
on anything.
Setup
-----
I've been testing under single-CPU kvm instances, though we see exactly
the same thing running on real multi-core hardware. There are two servers
with essentially identical configurations. We're using keepalived to
control load-balancing between the two, but again we do see the same
problem if we configure each manually with "ip addr add" and "ipvsadm"
directly. Each server is running 4 HTTP(S) servers: 2 Apache 2.2 HTTP,
1 Apache 2.2 HTTPS and a custom HTTP server, and we can see in the
logfiles the regular keepalived probes against them all.
The first machine to boot up becomes the keepalived master, the second
the backup. If no further action is taken things appear to function
normally. ipvsadm reports normal running weights of both servers of
around 250.
We're running a pure IPv4 workload by the way.
Repro
-----
If we administratively disable the master by causing our probe script to
assign a weight of 0 (but not stopping keepalived, so the master/backup
assignment remains the same), then the regular liveness probes are
usually sufficient to trigger the problem. If not, firing off 100 or so
simultaneous HTTPS requests from an external host against one of the
load-balanced services is always enough. The effect is dramatic, rapid
and predictable.
"top" on the backup reports a consistently high softirq (si) CPU% (100%
if nothing else is using CPU, otherwise as high as it can get) which
does not recover once external load is removed. ksoftirqd stays near the
top of the top process list. Interactive response and network throughput
drops through the floor and the machine essentially becomes unusable.
Attempting to reboot the box hangs at the end of the shutdown sequence
after the "Rebooting..." message, and the box must be physically reset.
The ifconfig stats show that lo is passing 1M packets per second,
compared to the 50-100 packets under normal operation. The external
interfaces have normal numbers. tcpdumping lo shows mostly packets
from the external host to one of the load balanced addresses, the
vast majority are FIN, with quite a few RST and a very much smaller
number of ACK and data packets. This capture was made 5-10 minutes
after the external load was removed, and it looks like the same
small set of FIN packets are occurring again and again.
In addition, the following message is output via dmesg every minute or so:
[ 876.182162] BUG: soft lockup - CPU#0 stuck for 61s! [ipvs_syncmaster:11280]
[ 876.182162] Modules linked in: k8_edac e752x_edac edac_core ip_vs_wrr fuse
ip_vs deflate zlib_deflate ctr twofish twofish_common camellia serpent
blowfish des_generic cbc aes_x86_64 aes_generic xcbc sha256_generic
sha1_generic md5 hmac crypto_hash cryptomgr crypto_null crypto_blkcipher
crypto_algapi af_key xt_MARK iptable_mangle xt_tcpudp xt_policy
xt_comment iptable_filter ip_tables x_tables floppy nvram i2c_piix4
i2c_core ata_piix ata_generic rtc e1000 virtio_net 8139cp mii virtio_blk
virtio_pci virtio_ring virtio uhci_hcd ohci_hcd ehci_hcd [last unloaded:
nf_conntrack]
[ 876.182162] CPU 0:
[ 876.182162] Modules linked in: k8_edac e752x_edac edac_core ip_vs_wrr
fuse ip_vs deflate zlib_deflate ctr twofish twofish_common camellia
serpent blowfish des_generic cbc aes_x86_64 aes_generic xcbc
sha256_generic sha1_generic md5 hmac crypto_hash cryptomgr crypto_null
crypto_blkcipher crypto_algapi af_key xt_MARK iptable_mangle xt_tcpudp
xt_policy xt_comment iptable_filter ip_tables x_tables floppy nvram
i2c_piix4 i2c_core ata_piix ata_generic rtc e1000 virtio_net 8139cp mii
virtio_blk virtio_pci virtio_ring virtio uhci_hcd ohci_hcd ehci_hcd
[last unloaded: nf_conntrack]
[ 876.182162] Pid: 11280, comm: ipvs_syncmaster Not tainted
2.6.27-rc4-30.0.jsullivan.CL1 #1
[ 876.182162] RIP: 0010:[<ffffffff80484740>] [<ffffffff80484740>]
skb_push+0x0/0x50
[ 876.182162] RSP: 0000:ffffffff80769c20 EFLAGS: 00000246
[ 876.182162] RAX: ffff880016921010 RBX: ffffffff80769c58 RCX: 0000000000000000
[ 876.182162] RDX: 0000000000000000 RSI: 000000000000000e RDI: ffff88005e86c0c0
[ 876.182162] RBP: ffffffff80769ba0 R08: ffffffff804b3d50 R09: ffffffffa013d9f0
[ 876.182162] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8020c406
[ 876.182162] R13: ffffffff80769ba0 R14: ffff88005e86c0c0 R15: 000000000000000e
[ 876.182162] FS: 0000000000000000(0000) GS:ffffffff806c7600(0000)
knlGS:0000000000000000
[ 876.182162] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 876.182162] CR2: 00007fe50dff0020 CR3: 000000001a833000 CR4: 00000000000006e0
[ 876.182162] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 876.182162] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 876.182162]
[ 876.182162] Call Trace:
[ 876.182162] <IRQ> [<ffffffff804b3e6f>] ? ip_finish_output+0x11f/0x2c0
[ 876.182162] [<ffffffff804b4361>] ? ip_output+0x71/0xb0
[ 876.182162] [<ffffffffa0132071>] ? ip_vs_dr_xmit+0xc1/0x1d0 [ip_vs]
[ 876.182162] [<ffffffffa012c557>] ? ip_vs_in+0x1a7/0x360 [ip_vs]
[ 876.182162] [<ffffffff804a917f>] ? nf_iterate+0x5f/0x90
[ 876.182162] [<ffffffff804af970>] ? ip_local_deliver_finish+0x0/0x210
[ 876.182162] [<ffffffff804a9213>] ? nf_hook_slow+0x63/0xf0
[ 876.182162] [<ffffffff804af970>] ? ip_local_deliver_finish+0x0/0x210
[ 876.182162] [<ffffffff804a917f>] ? nf_iterate+0x5f/0x90
[ 876.182162] [<ffffffff804b0028>] ? ip_local_deliver+0x78/0x90
[ 876.182162] [<ffffffff804af717>] ? ip_rcv_finish+0x157/0x3b0
[ 876.182162] [<ffffffff804afeb6>] ? ip_rcv+0x1f6/0x2f0
[ 876.182162] [<ffffffff8048ad82>] ? netif_receive_skb+0x312/0x520
[ 876.182162] [<ffffffff804c9a30>] ? tcp_write_timer+0x0/0x680
[ 876.182162] [<ffffffff8048d947>] ? process_backlog+0x77/0xe0
[ 876.182162] [<ffffffff8048d0c9>] ? net_rx_action+0xf9/0x1b0
[ 876.182162] [<ffffffff80239d2a>] ? __do_softirq+0x7a/0xf0
[ 876.182162] [<ffffffff8020c9bc>] ? call_softirq+0x1c/0x30
[ 876.182162] <EOI> [<ffffffff8020e35d>] ? do_softirq+0x3d/0x80
[ 876.182162] [<ffffffff8023a1ce>] ? local_bh_enable+0x9e/0xb0
[ 876.182162] [<ffffffff804abc17>] ? __ip_route_output_key+0x177/0xa90
[ 876.182162] [<ffffffff804b2556>] ? ip_cork_release+0x36/0x50
[ 876.182162] [<ffffffff804b3c52>] ? ip_push_pending_frames+0x2e2/0x3e0
[ 876.182162] [<ffffffff804ac561>] ? ip_route_output_flow+0x31/0x2b0
[ 876.182162] [<ffffffff804d1edf>] ? udp_sendmsg+0x56f/0x690
[ 876.182162] [<ffffffff804d8815>] ? inet_sendmsg+0x45/0x80
[ 876.182162] [<ffffffffa0133940>] ? sync_thread_master+0x0/0x200 [ip_vs]
[ 876.182162] [<ffffffff8047deef>] ? sock_sendmsg+0xdf/0x110
[ 876.182162] [<ffffffff8024bcaa>] ? enqueue_hrtimer+0x7a/0x100
[ 876.182162] [<ffffffff8024c8cc>] ? ktime_get_ts+0x4c/0x60
[ 876.182162] [<ffffffff80249360>] ? autoremove_wake_function+0x0/0x40
[ 876.182162] [<ffffffff8022d274>] ? hrtick_start_fair+0x154/0x170
[ 876.182162] [<ffffffff8022d31f>] ? pick_next_task_fair+0x8f/0xb0
[ 876.182162] [<ffffffff8047f594>] ? kernel_sendmsg+0x34/0x50
[ 876.182162] [<ffffffffa0132e87>] ? ip_vs_send_sync_msg+0x57/0x70 [ip_vs]
[ 876.182162] [<ffffffffa01339b0>] ? sync_thread_master+0x70/0x200 [ip_vs]
[ 876.182162] [<ffffffff80248f0d>] ? kthread+0x4d/0x80
[ 876.182162] [<ffffffff8020c659>] ? child_rip+0xa/0x11
[ 876.182162] [<ffffffff80248ec0>] ? kthread+0x0/0x80
[ 876.182162] [<ffffffff8020c64f>] ? child_rip+0x0/0x11
[ 876.182162]
Bug Location
------------
The bug is present in the latest 2.6.35 kernel and the head of the
lvs-test-2.6 branch. Using git-bisect I identified the following
commit within the 2.6.28 branch as the culprit:
commit f2428ed5e7bc89c7716ead22748cb5d076e204f0
Author: Simon Horman <horms@xxxxxxxxxxxx>
Date: Fri Sep 5 11:17:14 2008 +1000
ipvs: load balance ipv6 connections from a local process
In particular this hunk (the last in the patch) when backed out causes
the problem to go away:
diff --git a/net/ipv4/ipvs/ip_vs_core.c b/net/ipv4/ipvs/ip_vs_core.c
--- a/net/ipv4/ipvs/ip_vs_core.c
+++ b/net/ipv4/ipvs/ip_vs_core.c
@@ -1281,9 +1274,7 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb,
* Big tappo: only PACKET_HOST, including loopback for local client
* Don't handle local packets on IPv6 for now
*/
- if (unlikely(skb->pkt_type != PACKET_HOST ||
- (af == AF_INET6 || (skb->dev->flags & IFF_LOOPBACK ||
- skb->sk)))) {
+ if (unlikely(skb->pkt_type != PACKET_HOST)) {
IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s ignored\n",
skb->pkt_type,
iph.protocol,
This is within the function ip_vs_in(), which as can be seen is high up
on the oops backtrace above.
Now I don't understand the codebase anywhere near enough to know exactly
what each of the four original sub-conditions is trying to achieve
(except the af check, that much is obvious), however putting back those
sub-conditions one by one shows that the IFF_LOOPBACK check appears to be
important here. If I back out that hunk and remove just the af check or
the af check and the skb->sk check, then again, the bug disappears.
I suspect that removing the last two conditions causes this function to
attempt to handle more types of packet than it is prepared for,
re-injecting some particular sub-type, which it then gets to process
again, re-inject again, process again... essentially causing an infinite
loop within that kernel thread.
The ipvs directory was moved since that commit, this is a patch against
2.6.35 HEAD which restores the final two conditions:
diff -r -U3 a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
--- a/net/netfilter/ipvs/ip_vs_core.c 2008-09-08 00:34:43.000000000 +0100
+++ b/net/netfilter/ipvs/ip_vs_core.c 2010-09-22 16:44:43.402466098 +0100
@@ -1274,7 +1274,9 @@
* Big tappo: only PACKET_HOST, including loopback for local client
* Don't handle local packets on IPv6 for now
*/
- if (unlikely(skb->pkt_type != PACKET_HOST)) {
+ if (unlikely(skb->pkt_type != PACKET_HOST ||
+ (skb->dev->flags & IFF_LOOPBACK ||
+ skb->sk))) {
IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s ignored\n",
skb->pkt_type,
iph.protocol,
Note that the comment implies the original check was for "local packets
on IPv6" and that local IPv4 should have been handled, however all the
sub-conditions are ORed together so would have applied equally to IPv4
traffic, so I assume removing just the af check is enough to handle IPv6
in the same way as IPv4.
John
--
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
|