LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: [lvs-users] Heartbeat and ldirector taking a long time to change ove

To: lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [lvs-users] Heartbeat and ldirector taking a long time to change over.
From: Eric Renfro <erenfro@xxxxxxxxxxx>
Date: Tue, 22 Dec 2009 09:28:58 -0500
I believe I may have found the problem, but do not know how to resolve
the problem.

The servers in question are colocated not within the office, but
remotely access only, and the hosting provider told me the problem is
most likely because of the arp cache. The arp cache lasts for 5 minutes,
roughly exactly the same amount of time the conflict is taking place
which would also explain why the interrupts are going crazy during the
initial change over. They claim that a gratuitous arp should fix that,
but I am noticing heartbeat claiming to send that much.

*Eric Renfro*
Software Developer

EZYield.com, Inc
125 Excelsior Pkwy
Winter Springs, FL 32708

407-629-0900 ext 832

*_Join us for the 2010 EZYield.com World Tour_

-FITUR Madrid, Spain, January 20-24 Pavilion 8, Stand 8B29A
-Sabre Hospitality Solutions Customer Forum, Americas, Dallas, TX,
February 23-24
-ITB Berlin, Germany, March 10-14 Hall 10.1, Booth 111
-SoftBrands Hospitality User Forum, Scottsdale, AZ, March 16-18
-Sabre Hospitality Solutions Customer Forum, EMEA, London, UK, March 23-24
-Sabre Hospitality Solutions Customer Forum, APAC, Singapore, April 26-27
-HSMAI Revenue Management & Internet Marketing Strategy Conference,
Orlando, FL, June 21
-HITEC 2010, Orlando, FL, June 21-24
-World Travel Market, London, UK, November 8-11*


Eric Renfro wrote:
> Hello,
>
> I'm trying to resolve a current problem I have setting up a pair of LVS
> load balancing servers using heartbeat and ldirector under Gentoo.
>
> I am using heartbeat 2.0.8 on two servers and the heartbeat and
> ldirector setup is not very extensive but should be working better than
> it is. I will provide complete configurations, minus IP's themselves,
> but to explain the problem up front, the issues I'm having is rather
> strange.
>
> Our servers are named simply, network1, and network2, which I will use
> to explain the issue.
>
> How I am discovering these issues is when I shut down either network1 or
> network2's heartbeat process, it successfully releases the IP and passes
> it on to the other to take over. It does this rather quickly as
> expected, however, when it brings up ldirector, that is when the
> problems begin. We have two clusters of three webservers each, on both
> http and https ports. On network1, it immediately brings up the first
> cluster that was setup with all three RIP nodes active but inaccessible.
> All the others are weighted to 0 under a weighted-based setup, otherwise
> they are non-existent and going to the fallback server RIP initially.
> For about 5-10 minutes the replaced heartbeat+ldirector server has heavy
> CPU load with ksoftirq/0 and ksoftirq/1 being the culprits of the active
> CPU load, atop confirms this by having 3 irq's showing at 200%, 100%,
> and 100%, last 5-10 minutes.
>
> Once that all clears up and goes back to normal, ipvs routes show up
> almost instantaneously and furthermore, actually works.
>
> I do not know what is causing this issue and I would like some help to
> resolve this issue.
>
> Follow are the configuration files used. Virtual IP's are replaced by
> xx.xx.101.13 and xx.xx.101.16 because there are two VIP's involved.
> Related RIP's are also done similarly as Cluster1 (xx.xx.101.227,
> xx.xx.101.226, xx.xx.101.224) and Cluster 2 (xx.xx.108.102,
> xx.xx.101.183, xx.xx.101.184) being there are 6 total servers in two
> different clusters. The actual network server's IP's are, network1
> (xx.xx.101.153), and network2 (xx.xx.108.203).
>
> ha.d/haresources:
>
> network1.ourserver.com   xx.xx.101.13/24/eth0 ldirectord
> network1.ourserver.com    xx.xx.101.16/24/eth0 ldirectord
>
>
> ha.cf:
>
> logfacility     local0
> keepalive 2
> deadtime 30
> warntime 10
> bcast eth1
> auto_failback on
> node    network1.ourserver.com
> node    network2.ourserver.com
>
>
> ldirector.cf:
>
> checktimeout=3
> checkinterval=5
> #negotiatetimeout=5
> autoreload=yes
> logfile="local0"
> quiescent=no
>
> virtual = xx.xx.101.13:80
>         fallback = xx.xx.101.13:80 gate
>         real = xx.xx.101.227:80 gate
>         real = xx.xx.101.226:80 gate
>         real = xx.xx.101.224:80 gate
>         scheduler = lc
>         persistent = 7200
>         protocol = tcp
>         service = http
>         httpmethod = HEAD
>         request = "/"
>         checktype = negotiate
>
> virtual = xx.xx.101.13:443
>         fallback = xx.xx.101.13:443 gate
>         real = xx.xx.101.227:443 gate
>         real = xx.xx.101.226:443 gate
>         real = xx.xx.101.224:443 gate
>         scheduler = lc
>         persistent = 7200
>         protocol = tcp
>         service = https
>         httpmethod = HEAD
>         request = "/"
>         checktype = negotiate
>
> virtual = xx.xx.101.16:80
>         fallback = xx.xx.101.16:80 gate
>         real = xx.xx.108.102:80 gate
>         real = xx.xx.101.183:80 gate
>         real = xx.xx.101.184:80 gate
>         scheduler = lc
>         persistent = 7200
>         protocol = tcp
>         service = http
>         httpmethod = HEAD
>         request = "/"
>         checktype = negotiate
>
> virtual = xx.xx.101.16:443
>         fallback = xx.xx.101.16:443 gate
>         real = xx.xx.108.102:443 gate
>         real = xx.xx.101.183:443 gate
>         real = xx.xx.101.184:443 gate
>         scheduler = lc
>         persistent = 7200
>         protocol = tcp
>         service = https
>         httpmethod = HEAD
>         request = "/"
>         checktype = negotiate
>
>
> Here is a log of what happens when network2 is the active router at the
> time and gets shutdown while network1's heartbeat is in standby mode
> waiting to takeover:
>
> Dec 21 05:43:03 network1 heartbeat: [26940]: info: Received shutdown
> notice from 'network2.ourserver.com'.
> Dec 21 05:43:03 network1 heartbeat: [26940]: info: Resources being
> acquired from network2.ourserver.com.
> Dec 21 05:43:03 network1 heartbeat: [26959]: info: acquire all HA
> resources (standby).
> Dec 21 05:43:03 network1 ResourceManager[26973]: info: Acquiring
> resource group: network1.ourserver.com xx.xx.101.13/24/eth0 ldirectord
> Dec 21 05:43:03 network1 IPaddr[27020]: INFO:  Resource is stopped
> Dec 21 05:43:03 network1 IPaddr[27021]: INFO:  Resource is stopped
> Dec 21 05:43:03 network1 ResourceManager[26973]: info: Running
> /etc/ha.d/resource.d/IPaddr xx.xx.101.13/24/eth0 start
> Dec 21 05:43:03 network1 IPaddr[27163]: INFO: Using calculated netmask
> for xx.xx.101.13: 255.255.255.0
> Dec 21 05:43:03 network1 IPaddr[27147]: INFO:  Resource is stopped
> Dec 21 05:43:03 network1 IPaddr[27163]: DEBUG: Using calculated
> broadcast for xx.xx.101.13: xx.xx.101.255
> Dec 21 05:43:03 network1 heartbeat: [26961]: info: Local Resource
> acquisition completed.
> Dec 21 05:43:03 network1 heartbeat: [26940]: info: Initial resource
> acquisition complete (T_RESOURCES(us))
> Dec 21 05:43:03 network1 IPaddr[27163]: INFO: eval /sbin/ifconfig eth0:0
> xx.xx.101.13 netmask 255.255.255.0 broadcast xx.xx.101.255
> Dec 21 05:43:03 network1 IPaddr[27163]: DEBUG: Sending Gratuitous Arp
> for xx.xx.101.13 on eth0:0 [eth0]
> Dec 21 05:43:03 network1 IPaddr[27120]: INFO:  Success
> Dec 21 05:43:03 network1 ldirectord[27273]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord status
> Dec 21 05:43:03 network1 ldirectord[27273]: ldirectord is stopped for
> /etc/ha.d/ldirectord.cf
> Dec 21 05:43:03 network1 ldirectord[27273]: Exiting with exit_status 3:
> Exiting from ldirectord status
> Dec 21 05:43:03 network1 ResourceManager[26973]: info: Running
> /etc/ha.d/resource.d/ldirectord  start
> Dec 21 05:43:03 network1 ldirectord[27290]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord start
> Dec 21 05:43:03 network1 ldirectord[27290]: Starting Linux Director
> v1.186 as daemon
> Dec 21 05:43:03 network1 ldirectord[27292]: Added virtual server:
> xx.xx.101.13:80
> Dec 21 05:43:03 network1 ldirectord[27292]: Added virtual server:
> xx.xx.101.13:443
> Dec 21 05:43:03 network1 ldirectord[27292]: Added virtual server:
> xx.xx.101.16:80
> Dec 21 05:43:03 network1 ldirectord[27292]: Added virtual server:
> xx.xx.101.16:443
> Dec 21 05:43:03 network1 ldirectord[27292]: Added fallback server:
> xx.xx.101.13:80 (xx.xx.101.13:80) (Weight set to 1)
> Dec 21 05:43:03 network1 ldirectord[27292]: Added fallback server:
> xx.xx.101.13:443 (xx.xx.101.13:443) (Weight set to 1)
> Dec 21 05:43:03 network1 ResourceManager[27299]: info: Acquiring
> resource group: network1.ourserver.com xx.xx.101.16/24/eth0 ldirectord
> Dec 21 05:43:03 network1 ldirectord[27292]: Added fallback server:
> xx.xx.101.16:80 (xx.xx.101.16:80) (Weight set to 1)
> Dec 21 05:43:03 network1 ldirectord[27292]: Added fallback server:
> xx.xx.101.16:443 (xx.xx.101.16:443) (Weight set to 1)
> Dec 21 05:43:03 network1 IPaddr[27336]: INFO:  Resource is stopped
> Dec 21 05:43:03 network1 ResourceManager[27299]: info: Running
> /etc/ha.d/resource.d/IPaddr xx.xx.101.16/24/eth0 start
> Dec 21 05:43:03 network1 IPaddr[27423]: INFO: Using calculated netmask
> for xx.xx.101.16: 255.255.255.0
> Dec 21 05:43:03 network1 IPaddr[27423]: DEBUG: Using calculated
> broadcast for xx.xx.101.16: xx.xx.101.255
> Dec 21 05:43:03 network1 IPaddr[27423]: INFO: eval /sbin/ifconfig eth0:1
> xx.xx.101.16 netmask 255.255.255.0 broadcast xx.xx.101.255
> Dec 21 05:43:03 network1 IPaddr[27423]: DEBUG: Sending Gratuitous Arp
> for xx.xx.101.16 on eth0:1 [eth0]
> Dec 21 05:43:03 network1 IPaddr[27402]: INFO:  Success
> Dec 21 05:43:03 network1 ldirectord[27506]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord status
> Dec 21 05:43:03 network1 ldirectord[27506]: ldirectord for
> /etc/ha.d/ldirectord.cf is running with pid: 27292
> Dec 21 05:43:03 network1 ldirectord[27506]: Exiting from ldirectord status
> Dec 21 05:43:03 network1 ResourceManager[27299]: info: Running
> /etc/ha.d/resource.d/ldirectord  start
> Dec 21 05:43:04 network1 ldirectord[27523]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord start
> Dec 21 05:43:04 network1 heartbeat: [26959]: info: all HA resource
> acquisition completed (standby).
> Dec 21 05:43:04 network1 heartbeat: [26940]: info: Standby resource
> acquisition done [all].
> Dec 21 05:43:04 network1 harc[27527]: info: Running
> /etc/ha.d/rc.d/status status
> Dec 21 05:43:04 network1 mach_down[27537]: info:
> /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
> Dec 21 05:43:04 network1 mach_down[27537]: info: mach_down takeover
> complete for node network2.ourserver.com.
> Dec 21 05:43:04 network1 heartbeat: [26940]: info: mach_down takeover
> complete.
> Dec 21 05:43:04 network1 harc[27565]: info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> Dec 21 05:43:04 network1 ip-request-resp[27565]: received
> ip-request-resp xx.xx.101.13/24/eth0 OK yes
> Dec 21 05:43:04 network1 ResourceManager[27580]: info: Acquiring
> resource group: network1.ourserver.com xx.xx.101.13/24/eth0 ldirectord
> Dec 21 05:43:04 network1 IPaddr[27604]: INFO:  Running OK
> Dec 21 05:43:04 network1 ldirectord[27650]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord status
> Dec 21 05:43:04 network1 ldirectord[27650]: ldirectord for
> /etc/ha.d/ldirectord.cf is running with pid: 27292
> Dec 21 05:43:04 network1 ldirectord[27650]: Exiting from ldirectord status
> Dec 21 05:43:04 network1 ResourceManager[27580]: info: Running
> /etc/ha.d/resource.d/ldirectord  start
> Dec 21 05:43:04 network1 ldirectord[27667]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord start
> Dec 21 05:43:04 network1 harc[27671]: info: Running
> /etc/ha.d/rc.d/ip-request-resp ip-request-resp
> Dec 21 05:43:04 network1 ip-request-resp[27671]: received
> ip-request-resp xx.xx.101.16/24/eth0 OK yes
> Dec 21 05:43:04 network1 ResourceManager[27686]: info: Acquiring
> resource group: network1.ourserver.com xx.xx.101.16/24/eth0 ldirectord
> Dec 21 05:43:04 network1 IPaddr[27710]: INFO:  Running OK
> Dec 21 05:43:04 network1 ldirectord[27756]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord status
> Dec 21 05:43:04 network1 ldirectord[27756]: ldirectord for
> /etc/ha.d/ldirectord.cf is running with pid: 27292
> Dec 21 05:43:04 network1 ldirectord[27756]: Exiting from ldirectord status
> Dec 21 05:43:04 network1 ResourceManager[27686]: info: Running
> /etc/ha.d/resource.d/ldirectord  start
> Dec 21 05:43:05 network1 ldirectord[27773]: Invoking ldirectord invoked
> as: /etc/ha.d/resource.d/ldirectord start
> Dec 21 05:43:10 network1 ldirectord[27292]: Added real server:
> xx.xx.101.227:80 (xx.xx.101.13:80) (Weight set to 1)
> Dec 21 05:43:10 network1 ldirectord[27292]: Deleted fallback server:
> xx.xx.101.13:80 (xx.xx.101.13:80)
> Dec 21 05:43:10 network1 ldirectord[27292]: Added real server:
> xx.xx.101.226:80 (xx.xx.101.13:80) (Weight set to 1)
> Dec 21 05:43:11 network1 ldirectord[27292]: Added real server:
> xx.xx.101.224:80 (xx.xx.101.13:80) (Weight set to 1)
> Dec 21 05:43:34 network1 heartbeat: [26940]: WARN: node
> network2.ourserver.com: is dead
> Dec 21 05:43:34 network1 heartbeat: [26940]: info: Dead node
> network2.ourserver.com gave up resources.
> Dec 21 05:43:34 network1 heartbeat: [26940]: info: Link
> network2.ourserver.com:eth1 dead.
>
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.226:443 (xx.xx.101.13:443) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Deleted fallback server:
> xx.xx.101.13:443 (xx.xx.101.13:443)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.224:443 (xx.xx.101.13:443) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.108.102:80 (xx.xx.101.16:80) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Deleted fallback server:
> xx.xx.101.16:80 (xx.xx.101.16:80)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.183:80 (xx.xx.101.16:80) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.184:80 (xx.xx.101.16:80) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.108.102:443 (xx.xx.101.16:443) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Deleted fallback server:
> xx.xx.101.16:443 (xx.xx.101.16:443)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.183:443 (xx.xx.101.16:443) (Weight set to 1)
> Dec 21 05:46:21 network1 ldirectord[27292]: Added real server:
> xx.xx.101.184:443 (xx.xx.101.16:443) (Weight set to 1)
>
>   

-=-=-
This electronic message transmission contains information from the
Company that may be proprietary, confidential and/or privileged. The
information is intended only for the use of the individual(s) or entity
named above. If you are not the intended recipient, be aware that any
disclosure, copying or distribution or use of the contents of this
information is prohibited. If you have received this electronic
transmission in error, please notify the sender immediately by replying
to the address listed in the "From:" field.

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/

LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
<Prev in Thread] Current Thread [Next in Thread>