LVS
lvs-users
Google
 
Web LinuxVirtualServer.org

Re: Failover Between 2 Datacenters

To: "LinuxVirtualServer.org users mailing list." <lvs-users@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: Failover Between 2 Datacenters
From: nick garratt <nick-lvs@xxxxxxxxxxxxxx>
Date: Fri, 2 May 2003 09:19:31 +0200
Yup, the DNS cutover mechanism does seem like the best alternative in this case. 30s is extreme; I think, considering this is failover for a catastrophe occuring once or twice a year at most, 900s should be fine.

Consider this however:
DNS is a distributed DB essentially, but all mods are propagated from the master. Secondaries are available if a master fails, but not for zone transfers. This master DNS server still represents a single point of failure - should your master fall from the map (datacentre/network outage), how will you originate your zone change ? In this instance you will require the cooperation of your registrar to change the IP of your primary DNS server.

I acknowledge the merit of this discussion and the potential need for failover between datacentres, however it seem that part of the problem you seek to mitigate are issues that should be covered by SLAs with your providers. In this paradigm we must accept that there will always be variables beyond our control - script kiddies could DOS root nameservers, a long haul carrier could go bust, a worm could generate massive outages.

Your provider has an AS assigned to it; any IP range within this they should be able to route over their backbone infrastructure nation wide as they please. It would seem that some of these issues could be dealt with by hosting in two separate (regional) datacentres with the same provider with the agreement that under such a failure condition, that they can reallocate and reroute your IP block to the alternative datacentre. This is entirely feasible for them to do provided the will is there.


Nick


I know my California example is a bit extreme, but I
wanted to be sure we were talking about a complete
datacenter outage.  If I had a dedicated cabinet with
one CAT5 cable running to it, and some 3rd-shift
network engineer with too much coffee in their blood
knocks my feed from the datacenter switch, I consider
that a "data center outage" for our discussion
purposes.  Or what if a core router starts spewing out
faulty route broadcasts that quickly spread and
corrupt the routes of member routers ... effectively
crippling a network (or internet backbone) for hours.
Human error is very real.  I'm sure hundreds of
*realistic* scenarios could be thought of to justify
off-site redundancy, so lets just move on.

I mentioned VRRP to stress the desire for
near-instantaneous failover ... minimizing the amount
of downtime a client accessing your site may
experience. It should all be transparent as far as they're
concerned.  Obviously, without considerable expense,
this is not achievable.  It looks like the best
solution for this scenario is one using DNS with low
(30 sec?) TTL values.  It won't immediately failover
your services, but it may reduce your downtime from
hours to [several] minutes should some major outage
occur at your primary datacenter.  Nick, I agree with
you that DNS solutions are less than ideal,
considering there are so many factors out of your
control like caching DNS servers that ignore your TTL
values, but it seems to be the only solution for
cost-conscious companies forced to provide three or
more 9's of service to their clients.
For those of you out there who have ever supplied HA
services (SOAP Web Services in our case) to
Fortune-500,100,etc level companies know the
importance of redundant facilities in your service
offering or RFP replies.  You won't make it though
their due-dillegence process without it.

I want to thank all who have contributed to this
thread (on and off list) and acted as sounding boards
for my discussion.  I felt a "Global" or "Datacenter-
level" failover solution hadn't been discussed in
enough detail in any online forum I'd found, and the
LVS group seemed to be the perfect one.

Thanks!
-Ken

--- nick garratt <nick-lvs@xxxxxxxxxxxxxx> wrote:
 Well a State falling off the map is hardly a failure
 > situation that
 makes sense  building in 60s minimum latency cutover
 for. What if the
 United States fell off the map ? What if the map
 ceased to exist ?

 Keeping your DNS TTLs really low can help you
 somewhat in this
 situation, although they certainly cannot be set to
 not cache at all.
 Also I have also encountered DNS servers that do not
 correctly
 observe these settings. You have no control over all
 the intermediate
 name servers that might be caching your DNS records
 and thus is not
 suited to low latency failover.

 VRRP is basically IP failover through election and
 is not relevant to
 this discussion.

 It seems to me the only satisfactory solution is for
 you to apply for
 your own autonomous system (at considerable cost)
 which will allow
 you full control of your BGP data. It will be
 possible with
 cooperation for other AS admins to ensure
 substantial route
 redundancy and rapid cutovers should you lose a
 datacentre/state/continent :)




__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
_______________________________________________
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://www.in-addr.de/mailman/listinfo/lvs-users

<Prev in Thread] Current Thread [Next in Thread>