Re: Problems with FOS (fwd)

To:	lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Subject:	Re: Problems with FOS (fwd)
From:	John Cronin <jsc3@xxxxxxxxxxxxx>
Date:	Wed, 25 Oct 2000 10:48:09 -0400 (EDT)
>From jsc3 Wed Oct 25 10:37:10 2000
Subject: Re: Problems with FOS
To: ywteh@xxxxxxxxxxxxxx (Teh Yong Wei)
Date: Wed, 25 Oct 2000 10:37:10 -0400 (EDT)
In-Reply-To: <200010250215.KAA29010@xxxxxxxxxxxxxx> from "Teh Yong Wei" at Oct 
25, 2000 10:15:16 AM
X-Mailer: ELM [version 2.5 PL1]
Content-Length: 6783      

> Thank you for ur reply.
> 
> > > But, when I plugged it back the 2 NIC to the gretel, the primary
> node
> > > (gretel) didn't become active. Indeed, I cannot access the page
> > > 10.0.0.38 anymore. Why is this happen?
> > 
> > I am not sure.  Can you use tcpdump to watch what is happening>
> > What does ifconfig show?  Do both gretel and gretelf have the VIP
> > (10.0.0.38) up?  If so, that is going to be a problem, for sure.
> > Perhaps when you reconnect gretel, it doesn't realize it failed and it
> > is just resuming.  I don't have enough experience with Piranha
> > to know what it does in this case.
> 
> This is what I got from /var/log/messages at gretel:
> ==================================
>  Oct 25 09:58:07 gretel nanny[12924]: running command  "rsh" "10.0.1.4"
> "uptime"

[snip]

> What does this mean?

This means nanny is using rsh to check on the health of your realservers.
It is finding them both to be healthy.
 
> Here is the /var/log/messages for gretelf:
> =================================
> Oct 25 04:02:00 gretelf anacron[22218]: Updated timestamp for job
> `cron.daily' to 2000-10-25

> Oct 25 09:52:45 gretelf pulse[21302]: partner dead: activating lvs

The failover has discovered that the partner is dead.

> Oct 25 09:52:45 gretelf pulse[22803]: running command  "/sbin/ifconfig"
> "eth1:0" "10.0.1.254" "up"

The failover has taken over and the NAT router.

> Oct 25 09:52:45 gretelf pulse[22804]: running command  "/sbin/ifconfig"
> "eth0:0" "10.0.0.38" "up"

The failover has taken over the VIP.

> Oct 25 09:52:45 gretelf pulse[21302]: partner active: deactivating lvs

The failover has noticed that the primary has come back to life.

> Oct 25 09:52:45 gretelf pulse[22805]: running command  "/sbin/ifconfig"
> "eth0:0" "down"

The failover is dropping the VIP.  The primary should be picking it up.

> Oct 25 09:52:45 gretelf pulse[22801]: running command 
> "/usr/sbin/send_arp" "-i" "eth1" "10.0.1.254" "0010B556E9A6"
> "10.0.1.255" "ffffffffffff"

The failover is sending an arp - I assume that 0010B556E9A6 is the
MAC address of the Ethernet interface used for the NAT connection
to the realservers ON THE PRIMARY SYSTEM (ie the failover is telling
everybody to talk to the primary - or it should be).  Use "ifconfig -a"
on both gretel and gretelf, look for "HWaddr 00:10:B5:56:E9:A6"; which
one has this?  It should be on gretel and it should have the IP address
10.0.1.254, probably on eth1:0.

> Oct 25 09:52:45 gretelf pulse[22802]: running command 
> "/usr/sbin/send_arp" "-i" "eth0" "10.0.0.38" "00104BCA8523" "10.0.0.255"
> "ffffffffffff"

The failover is sending an arp - I assume that 00104BCA8523 is the
MAC address of the Ethernet interface used for the VIP *ON THE PRIMARY
SYSTEM* (ie the failover is telling everybody to talk to the primary -
or it should be).  Use "ifconfig -a" on both gretel and gretelf, look
for "HWaddr 00:10:4B:CA:85:23"; which one has this?  It should be on
gretel and it should have the IP address 10.0.0.38, probably on eth0:0.

> Oct 25 09:52:45 gretelf pulse[22808]: running command  "/sbin/ifconfig"
> "eth1:0" "down"

The failover is dropping the NAT (the primary should be picking it up).

> Oct 25 09:52:51 gretelf pulse[22800]: gratuitous lvs arps finished

The failover, gretelf, is indicating that it has handed everything off
to the primary (gretel).

If the send_arps are doing what I think they are doing, they should
have sent the MAC addresses of the primary (gretel); if they in
fact did this, then the primary needs to be doing it's job - it
appears the failover is probably functioning perfectly.

So you need to look at gretel.  Pulse should be running on gretel.
gretel should have two virtual interfaces on eth0:0 (inet addr: 10.0.0.38,
HWaddr 00:10:4B:CA:85:23), and eth1:0 (inet addr: 10.0.1.254,
HWaddr 00:10:B5:56:E9:A6).  If this is not the case, the that is why
your cluster doesn't work after you bring up the primary (gretel) again.

> Oct 25 09:52:45 gretelf pulse[21302]: partner dead: activating lvs
> Oct 25 09:52:45 gretelf pulse[22803]: running command  "/sbin/ifconfig"
> "eth1:0" "10.0.1.254" "up"
> Oct 25 09:52:45 gretelf pulse[22804]: running command  "/sbin/ifconfig"
> "eth0:0" "10.0.0.38" "up"
> Oct 25 09:52:45 gretelf pulse[21302]: partner active: deactivating lvs
> Oct 25 09:52:45 gretelf pulse[22805]: running command  "/sbin/ifconfig"
> "eth0:0" "down"
> Oct 25 09:52:45 gretelf pulse[22801]: running command 
> "/usr/sbin/send_arp" "-i" "eth1" "10.0.1.254" "0010B556E9A6"
> "10.0.1.255" "ffffffffffff"
> Oct 25 09:52:45 gretelf pulse[22802]: running command 
> "/usr/sbin/send_arp" "-i" "eth0" "10.0.0.38" "00104BCA8523" "10.0.0.255"
> "ffffffffffff"
> Oct 25 09:52:45 gretelf pulse[22808]: running command  "/sbin/ifconfig"
> "eth1:0" "down"
> Oct 25 09:52:51 gretelf pulse[22800]: gratuitous lvs arps finished
> ====================================
> What does this mean??

Hmm, now that I look at it further, you have all this in your logs
twice?  Perhaps you /etc/syslog.conf is configured strangely.

Also, I notice that the notices that the partner is dead, the partner is
up, the ifconfigs down and up, the arps, all of it, happens in the
exact same second.  I may not be understanding this correctly at
all, or it may not be working correctly.

Why would pulse report the partner dead, and then active again,
in the same second.  Did you unplug and replug the network to the
primary (gretel) very quickly?  If not, something seems to be
broken here to me, but I have no idea what.

Also, I was wondering why the failover would send arps for the
master, when it makes more sense for the master to send it's
own arps.  I now think the send_arps are part of the failover
taking over for the dead master, which makes a lot more sense.

So, what do the logs on the primary (gretel) say? 

Also, it would help if you could post the results of "ifconfig -a"
for both the gretel and gretelf during this process.

> I restarted the httpd on gretel. httpd on gretelf, gretel3 and gretel4 
> is still running.

The httpd on gretel should be of no consequence.  The *ONLY* thing
that the httpd on gretel should be used for is the piranha
admin gui (http://gretel/piranha, I think, or something like that).
 
> Another doubt: In lvs, when primary node is down, the backup node will
> take over. Then, why will still need fos?

You would need fos if you did not want load balancing but just want
redundancy.  If you use fos, then you would run the httpd that delivers
content on gretel and gretelf - there would be no need for gretel3
and gretel4.  All web traffic would go to gretel and stop there.
If gretel died, then all web traffic would go to gretelf and stop
there.  It is much simpler, and there is NO load balancing at all.

-- 
John Cronin


-- 
John Cronin
<Prev in Thread]	Current Thread	[Next in Thread>
Re: Problems with FOS (fwd), John Cronin Re: Problems with FOS (fwd), John Cronin <=
Previous by Date:	Re: Problem with lvs., John Cronin
Next by Date:	Re: Release new code: Scheduler for distributed caching, Julian Anastasov
Previous by Thread:	Re: Problems with FOS (fwd), John Cronin
Next by Thread:	Make zImage Error, Supin Wanna
Indexes:	[Date] [Thread] [Top] [All Lists]