I've got a pair of fileservers running DRBD using heartbeat and NFS
(active/passive). Seeing as how these two NFS servers serve the entire
office, I figured I'd also put Nagios onto this pair (not on the NFS
filesystem) since if they were both down, then everything would be out of
service (so there wouldn't be much point monitoring everything being down ;)
).
Unless the monitoring caches local health status for traceability, but
installing executables on a NFS partition is suicide anyway :).
The problem I am having seems to be that now that I've got the LVS setup and
running, DRBD will no longer start. Looking through the /var/log/ha-debug
Those have absolutely nothing in common so I suspect either a software
configuration problem or heartbeat problem. Since heartbeat is rock
solid, I'll jump right at your configuration.
Sidenote: The linux-ha mailinglist is full of experts regarding linux-HA
issues. I understand why you've posted this here, but we might need a
bit longer to help you out.
logs, it appears that none of the DRBD commands ever get a proper start
command. My haresources file looks like this:
mimir.yggdrasil \
drbddisk::clients0 \
Filesystem::/dev/drbd0::/opt/mnt/data::ext3 \
killnfsd \
nfs \
nfslock \
Delay::3::0 \
IPaddr2::10.0.0.3/24/eth2/10.0.0.255
mimir.yggdrasil \
ldirectord \
LVSSyncDaemonSwap::master \
IPaddr2::192.168.0.3/24/eth1/192.168.0.255
The second mimir.yggdrasil does not make sense to me at first sight.
What's the purpose of it? Why is the killnfsd needed and how does it
look like? Also, I'd put the IPaddr2 resource first, because other
daemons might need it so they don't need to lazy bind.
After the Delay command,everything in the first section shuts down again.
What's the first section?
Is it not possible to have multiple sections in haresources? Should
everything be combined into the one section?
It's possible to have multiple node configuration, like so:
node-A \
res-1 \
res-2 \
res-3
node-B \
res-4 \
res-5
My ha-debug log looks like this:
heartbeat: 2006/11/12_01:21:23 debug: StartNextRemoteRscReq(): child count 1
heartbeat: 2006/11/12_01:21:23 debug: Starting /etc/ha.d/resource.d/drbddisk
clients0 start
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/drbddisk clients0
start done. RC=0
Very good.
heartbeat: 2006/11/12_01:21:23 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/mnt
/data ext3 start
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /opt/mnt/data ext
3 start done. RC=0
Very good.
nfsd: no process killed
heartbeat: 2006/11/12_01:21:23 debug: Starting /etc/ha.d/resource.d/killnfsd
start
nfsd: no process killed
heartbeat: 2006/11/12_01:21:23 debug: /etc/ha.d/resource.d/killnfsd start
done. RC=1
Not so good, wrong return code. So heartbeat has a resource problem and
will shut down the node ... in reverse order of the last semantically
correct resource configuration item, which is:
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/IPaddr
10.0.0.3/24/eth2 stop
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/IPaddr
10.0.0.3/24/eth2 stop done. RC=0
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/Delay 3
0 stop
Delay already stopped
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/Delay 3 0 stop
done. RC=0
So far so good (maybe not for you, but for heartbeat)
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/init.d/nfslock stop
Stopping NFS locking: [FAILED]
Stopping NFS statd: [FAILED]
heartbeat: 2006/11/12_01:21:24 debug: /etc/init.d/nfslock stop done. RC=0
I'm not yet sure why you need this nfslock stuff and especially the
killnfsd.
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/init.d/nfs stop
Shutting down NFS mountd: [FAILED]
Shutting down NFS daemon: [FAILED]
Shutting down NFS quotas: [FAILED]
Shutting down NFS services: [ OK ]
heartbeat: 2006/11/12_01:21:24 debug: /etc/init.d/nfs stop done. RC=0
Seems to have worked wonderfully.
heartbeat: 2006/11/12_01:21:24 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:24 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
Wrong return value again, but this time in the resource release state of
heartbeat. This mean, we will try again ...
heartbeat: 2006/11/12_01:21:25 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
... and again ...
heartbeat: 2006/11/12_01:21:25 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:26 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:26 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:27 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:27 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:28 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:28 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:29 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:29 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:30 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:30 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:31 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:31 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:32 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:32 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:33 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:33 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
heartbeat: 2006/11/12_01:21:34 debug: Starting /etc/ha.d/resource.d/killnfsd
stop
nfsd: no process killed
heartbeat: 2006/11/12_01:21:34 debug: /etc/ha.d/resource.d/killnfsd stop
done. RC=1
nfsd: no process killed
Boah, heartbeat got tired of being messed around with and tries
something new to irritate the user :)
heartbeat: 2006/11/12_01:21:34 debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/mnt/data ext3 stop
Fine, next in list to stop.
heartbeat: 2006/11/12_01:21:35 debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /opt/mnt/data ext3 stop done. RC=0
heartbeat: 2006/11/12_01:21:35 debug: Starting /etc/ha.d/resource.d/drbddisk
clients0 stop
heartbeat: 2006/11/12_01:21:35 debug: /etc/ha.d/resource.d/drbddisk clients0
stop done. RC=0
Ok, all resources are now released. Now we enter the undefined area of
haresources parsing, the second node configuration, which is actually
the same node, but with new resources, such as:
ldirectord is stopped for /etc/ha.d/ldirectord.cf
heartbeat: 2006/11/12_01:21:35 debug: Starting /etc/init.d/ldirectord start
Starting ldirectord [ OK ]
heartbeat: 2006/11/12_01:21:36 debug: /etc/init.d/ldirectord start done.
RC=0
Which works fine.
heartbeat: 2006/11/12_01:21:36 debug: Starting
/etc/ha.d/resource.d/LVSSyncDaemonSwap master start
heartbeat: 2006/11/12_01:21:36 debug: /etc/ha.d/resource.d/LVSSyncDaemonSwap
master start done. RC=0
Which works fine also.
heartbeat: 2006/11/12_01:21:36 debug: Starting /etc/ha.d/resource.d/IPaddr2
192.168.0.3/24/eth1/192.168.0.255 start
Which works fine and is also the last resource to be started in the
second mimir node configuration. The node mimir has now successfully
released its resources and the node mimir (same machine) has take over
with its resources.
heartbeat: 2006/11/12_01:22:17 debug: Received standby message done from
mimir.yggdrasil in state 0
heartbeat: 2006/11/12_01:22:17 debug: RscMgmtProc 'go_standby' exited code 0
You've just shown a startup, shutdown, standby and resource acquisition
performed on a single node. I've never tried that before but I doubt
it's what you intended to do.
Can you show your ha.cf? What are the names of your two nodes? Right now
they are mimir.yggdrasil and mimir.yggdrasil.
Best regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
|