I know this has been discussed before, but has anybody done this much?
The setup I have is as follows:
Clients (4):
PIII 933
3x100mbit ethernet, bonded
Director (1):
PIII 933
1x100mbit ethernet
Realservers (2):
PIII 933
1x100mbit ethernet
1gb Fibrechannel
GFS shared filesystem cluster running on a hardware RAID array
The primary protocol I am interested in here is NFS. I have the director
setup with DR with LC scheduling, no persistence, with UDP connections timing
out after 5 seconds. I figured the time it would need to be accessing the
same host would be when reading a file, so they are not all in contention for
the same file, which seems to cost preformance in GFS. That would all come
in a series of accesses. So there is not much need to keep the traffic to
the same host beyond 5 seconds.
Now initially it was fine. Because they have the same idea of the filesystem
on the same device, NFS is perfectly happy serving from the two hosts even
though only one mounted.
However, I am starting to see a few funny things which I don't know if they
are caused by GFS, NFS, or the fact that I am load balancing both. The first
problem I started to see was when I started a stress test to see how the
preformance was. As expected, running through LVS match preformance with
when I manually split the traffic between the NFS servers. However, I was
also writing the log file from this across LVS onto the GFS filesystem. And
it was in this log file that I started to see problems. As the file was
being written, occasionally a chunk of data would be replaced with all zeros.
Now my initial thought was that it was data going to different GFS hosts,
and that one did not have the previously written data yet. But AFAIK GFS is
pretty good about synchronizing data. And in computer terms, 5 seconds is a
long time, and that is how long the data would need to be out of sync for it
to switch to the other server for the next write.
So I stopped that test and eliminated NFS and LVS from the picture, instead
running the stress test coordinator on one of the realservers, writing the
log directly to NFS, to a different file. However, now the server I *had*
been using to write the log is reporting a stale NFS handle on the file it
had been writing. All other hosts see it fine.
I'm tempted to think this is a GFS bug, but I am not ruling anything out at
the moment.
On another note, has anybody else had success with NFS? Any config
recommendations? Perhaps an alternative to GFS that would work well under
the kind of load I am putting it under? (We are probably going to try CODA
next)
|