LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: LVS + DRBD

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: LVS + DRBD
From: Roberto Nibali <ratz@xxxxxxxxxxxx>
Date: Mon, 20 Nov 2006 22:50:03 +0100
I've got a pair of fileservers running DRBD using heartbeat and NFS
(active/passive).  Seeing as how these two NFS servers serve the entire
office, I figured I'd also put Nagios onto this pair (not on the NFS
filesystem) since if they were both down, then everything would be out of
service (so there wouldn't be much point monitoring everything being down ;)
).

Unless the monitoring caches local health status for traceability, but installing executables on a NFS partition is suicide anyway :).

The problem I am having seems to be that now that I've got the LVS setup and
running, DRBD will no longer start.  Looking through the /var/log/ha-debug

Those have absolutely nothing in common so I suspect either a software configuration problem or heartbeat problem. Since heartbeat is rock solid, I'll jump right at your configuration.

Sidenote: The linux-ha mailinglist is full of experts regarding linux-HA issues. I understand why you've posted this here, but we might need a bit longer to help you out.

logs, it appears that none of the DRBD commands ever get a proper start
command.  My haresources file looks like this:

mimir.yggdrasil \
        drbddisk::clients0 \
        Filesystem::/dev/drbd0::/opt/mnt/data::ext3 \
        killnfsd \
        nfs \
        nfslock \
        Delay::3::0 \
        IPaddr2::10.0.0.3/24/eth2/10.0.0.255
mimir.yggdrasil \
        ldirectord \
        LVSSyncDaemonSwap::master \
IPaddr2::192.168.0.3/24/eth1/192.168.0.255

The second mimir.yggdrasil does not make sense to me at first sight. What's the purpose of it? Why is the killnfsd needed and how does it look like? Also, I'd put the IPaddr2 resource first, because other daemons might need it so they don't need to lazy bind.

After the Delay command,everything in the first section shuts down again.

What's the first section?

Is it not possible to have multiple sections in haresources?  Should
everything be combined into the one section?

It's possible to have multiple node configuration, like so:

node-A \
  res-1 \
  res-2 \
  res-3

node-B \
  res-4 \
  res-5

 My ha-debug log looks like this:
heartbeat: 2006/11/12_01:21:23 debug: StartNextRemoteRscReq(): child count 1
heartbeat: 2006/11/12_01:21:23 debug: Starting /etc/ha.d/resource.d/drbddisk
clients0 start
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/drbddisk clients0
start done. RC=0

Very good.

heartbeat: 2006/11/12_01:21:23 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/mnt
/data ext3 start
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /opt/mnt/data ext
3 start done. RC=0

Very good.

nfsd: no process killed
heartbeat: 2006/11/12_01:21:23 debug: Starting /etc/ha.d/resource.d/killnfsd
start
nfsd: no process killed
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/killnfsd  start
done. RC=1

Not so good, wrong return code. So heartbeat has a resource problem and will shut down the node ... in reverse order of the last semantically correct resource configuration item, which is:

heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/IPaddr
10.0.0.3/24/eth2 stop
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/IPaddr
10.0.0.3/24/eth2 stop done. RC=0
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/Delay 3
0 stop
Delay already stopped
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/Delay 3 0 stop
done. RC=0

So far so good (maybe not for you, but for heartbeat)

heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/init.d/nfslock  stop
Stopping NFS locking: [FAILED]
Stopping NFS statd: [FAILED]
heartbeat: 2006/11/12_01:21:24 debug: /etc/init.d/nfslock  stop done. RC=0

I'm not yet sure why you need this nfslock stuff and especially the killnfsd.

heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/init.d/nfs  stop
Shutting down NFS mountd: [FAILED]
Shutting down NFS daemon: [FAILED]
Shutting down NFS quotas: [FAILED]
Shutting down NFS services:  [  OK  ]
heartbeat: 2006/11/12_01:21:24 debug: /etc/init.d/nfs  stop done. RC=0

Seems to have worked wonderfully.

heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1

Wrong return value again, but this time in the resource release state of heartbeat. This mean, we will try again ...

heartbeat: 2006/11/12_01:21:25 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed

... and again ...

heartbeat: 2006/11/12_01:21:25 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:26 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:26 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:27 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:27 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:28 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:28 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:29 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:29 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:30 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:30 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:31 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:31 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:32 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:32 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:33 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:33 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
heartbeat: 2006/11/12_01:21:34 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:34 debug: /etc/ha.d/resource.d/killnfsd  stop
done. RC=1
nfsd: no process killed

Boah, heartbeat got tired of being messed around with and tries something new to irritate the user :)

heartbeat: 2006/11/12_01:21:34 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/mnt/data ext3 stop

Fine, next in list to stop.

heartbeat: 2006/11/12_01:21:35 debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /opt/mnt/data ext3 stop done. RC=0
heartbeat: 2006/11/12_01:21:35 debug: Starting /etc/ha.d/resource.d/drbddisk
clients0 stop
heartbeat: 2006/11/12_01:21:35 debug: /etc/ha.d/resource.d/drbddisk clients0
stop done. RC=0

Ok, all resources are now released. Now we enter the undefined area of haresources parsing, the second node configuration, which is actually the same node, but with new resources, such as:

ldirectord is stopped for /etc/ha.d/ldirectord.cf
heartbeat: 2006/11/12_01:21:35 debug: Starting /etc/init.d/ldirectord  start
Starting ldirectord [  OK  ]
heartbeat: 2006/11/12_01:21:36 debug: /etc/init.d/ldirectord  start done.
RC=0

Which works fine.

heartbeat: 2006/11/12_01:21:36 debug: Starting
/etc/ha.d/resource.d/LVSSyncDaemonSwap master start
heartbeat: 2006/11/12_01:21:36 debug: /etc/ha.d/resource.d/LVSSyncDaemonSwap
master start done. RC=0

Which works fine also.

heartbeat: 2006/11/12_01:21:36 debug: Starting /etc/ha.d/resource.d/IPaddr2
192.168.0.3/24/eth1/192.168.0.255 start

Which works fine and is also the last resource to be started in the second mimir node configuration. The node mimir has now successfully released its resources and the node mimir (same machine) has take over with its resources.

heartbeat: 2006/11/12_01:22:17 debug: Received standby message done from
mimir.yggdrasil in state 0 heartbeat: 2006/11/12_01:22:17 debug: RscMgmtProc 'go_standby' exited code 0

You've just shown a startup, shutdown, standby and resource acquisition performed on a single node. I've never tried that before but I doubt it's what you intended to do.

Can you show your ha.cf? What are the names of your two nodes? Right now they are mimir.yggdrasil and mimir.yggdrasil.

Best regards,
Roberto Nibali, ratz
--
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

<Prev in Thread] Current Thread [Next in Thread>
  • LVS + DRBD, Dan Brown
    • Re: LVS + DRBD, Roberto Nibali <=