Ok, I have found a bit more information from my debugging,
and it seems that Horms already knows about it:
http://marc.info/?l=linux-netdev&m=118040107213444&w=2
Basicly, I adjust the values a lot too, not as often as
two times per second, but quite often. I recompiled the
Ipvs modules with a bit more debug and everytime my system
crashed I had the same debug output:
Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 913
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 916
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 918
Jul 12 15:43:15 atropos kernel: Leave: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 919
Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910
The code after line 910 reads:
while (atomic_read(&svc->usecnt) > 1) {};
Every other busy lock in the code reads:
IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);
Which basicly is the same except a cpu_relax();
At the moment I am testing my server with cpu_relax() code in the
ip_vs_edit_dest function, and so far it has not crashed yet and is directing
the traffic quite a bit longer than previously was possible.
The only differences between this server and the old server (which didn't
have any problems) are:
- SMP (4 cores) vs Single core
- 64 bits vs 32 bits
- 2.6.21.5 vs 2.6.20.4 (but I do not see any changes in ip_vs_ctl.c)
In my first mail I accused the 64/32 bits difference, but right now I'm more
thinking of a SMP issue, but unfortunatly I lack the kernel hacking skills
to say why, or why that cpu_relax() helps so much in the while loop.
Well, hopefully Horms understands it better than I do ;)
-kees
--- linux-2.6.22.1/net/ipv4/ipvs/ip_vs_ctl.c 2007-07-12
19:41:27.000000000 +0200
+++ old/net/ipv4/ipvs/ip_vs_ctl.c 2007-07-10 20:56:30.000000000 +0200
@@ -909,8 +909,8 @@
write_lock_bh(&__ip_vs_svc_lock);
/* Wait until all other svc users go away */
- IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);
-
+ while (atomic_read(&svc->usecnt) > 1) {};
+
/* call the update_service, because server weight may be changed */
svc->scheduler->update_service(svc);
> -----Original Message-----
> From: lvs-users-bounces@xxxxxxxxxxxxxxxxxxxxxx
> [mailto:lvs-users-bounces@xxxxxxxxxxxxxxxxxxxxxx] On Behalf
> Of Kees Hoekzema
> Sent: Wednesday, July 11, 2007 14:28
> To: 'LinuxVirtualServer.org users mailing list.'
> Subject: [lvs-users] LVS-NAT wrr crashing on 64-bits
>
>
> Well, apparently it didn't have to do anything with the NAT
> issue Cristi was having; so let's split those two problems as
> it would seem I have an other problem than him ;).
>
> -kees
>
>
>
> > -----Original Message-----
> > From: lvs-users-bounces@xxxxxxxxxxxxxxxxxxxxxx
> > [mailto:lvs-users-bounces@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Kees
> > Hoekzema
> > Sent: Wednesday, July 11, 2007 14:20
> > To: 'LinuxVirtualServer.org users mailing list.'
> > Subject: Re: [lvs-users] LVS-NAT issue
> >
> >
> >
> > > -----Original Message-----
> > > The problem is as folows: the setup works randomly, fron
> 15 mins to
> > > 1-2 hours, flawlessly, i might add, serving content from
> > both backend
> > > machines. However, it randomly stops doing that. When that
> > happens, i
> > > cannot ping the VIP from the outside, only from within the
> > LAN (i have
> > > a backup LB, not configured yet, i plan to use ultramonkey
> > later on).
> > > I checked logs, tcpdumped but with no clue as of what is
> > causing this.
> > > Some input would be really appreciated.
> >
> > Now I know this is an old message, and this issue has been
> 'resolved'
> > by not using LVS-NAT anymore, but recently I had a similar problem.
> >
> > Let me explain my setup first; I have two loadbalancers,
> which use wrr
> > to direct trafic to 5 realservers. A small script on the
> loadbalancers
> > checks the realservers periodically and requests some numbers from
> > them. Based on those numbers the weight of the server is adjusted
> > using 'ipvsadm --edit-server'.
> >
> > The setup i described above worked flawlessly for years (well
> > - after an iptables problem, and after a small patch to the
> wrr code)
> > until my trafic could spike so high the loadbalancers were
> not able to
> > handle it properly.
> > So we decided to upgrade the loadbalancers with new hardware.
> >
> > The new hardware runs on a quadcore 64-bits Xeon, while the
> old had a
> > 32 bits Celeron, so quite an upgrade, and more notable, the
> new server
> > was able to process 950 mbit with only 20% cpu time, while
> the old one
> > was eating up more than 90% cputime at around 60 mbit.
> >
> > So we went from a 32 bits OS to a 64 bits OS. We tested the
> hardware
> > and it seemed stable, next we put them into production and after
> > several hours they would crash and would not respond to
> anything, much
> > like Cristi experienced before.
> > So we pulled them out and put in the old loadbalancers again and we
> > started testing a bit more.
> >
> > After running and writing several program's i got the
> loadbalancers to
> > crash finally again but this time in our testing environment. To
> > achieve a crash i had to generate enough traffic from
> different ip's
> > and ports through the ipvs services while running 'ipvsadm
> > --edit-server' on the loadbalancer.
> > Running the traffic through iptables wouldn't crash the server, nor
> > would one client ip from different ports bashing the services work.
> >
> > So i started debugging a lot more and i am still working on it, the
> > problem being is that the server will freeze totally, so i
> can't look
> > up anything.
> > but it seems that changing the weights on the server will make your
> > system crash if you run it on a 64 bits OS. our 'old' 32 bits
> > environment still happily changes the values of the servers every
> > couple of seconds without crashing. So there is somewhere
> in the code
> > of the ipvsadm program, or in the kernel code a problem -
> so i'll keep
> > debugging.
> >
> > What i want to know is if there is anyone out there with:
> > 1) a 64 bits installation
> > 2) using wrr
> > 3) is changing the weights on the server while the server
> is getting
> > heavy traffic from multiple ip:ports And is experiencing the same
> > problems as i do; a freezing server which needs a cold reset
> >
> > For the moment, ill just keep looking at traces to see if i
> can spot
> > anything particular, and i hope anyone got a suggestion as
> to where to
> > look / what debugger to use.
> >
> > -kees
> >
> >
> > _______________________________________________
> > LinuxVirtualServer.org mailing list -
> lvs-users@xxxxxxxxxxxxxxxxxxxxxx
> > Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
> > or go to http://lists.graemef.net/mailman/listinfo/lvs-users
> >
>
>
> _______________________________________________
> LinuxVirtualServer.org mailing list -
> lvs-users@xxxxxxxxxxxxxxxxxxxxxx Send requests to
> lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
> or go to http://lists.graemef.net/mailman/listinfo/lvs-users
>
|