Click my face to go back to my home page!

 

Tony on MySpace



My photo album
Visit my blog

Google
Web
northrup.org

Info on my camera equipment
Info on my books
Info on my magazine articles


Frequently asked questions
Info about this site's server hardware, software, and logs
Info about this site's server hardware, software, and logs

Designing Geographically Distributed Web Hosting Solutions

(An article I originally wrote for Windows 2000 Experts Journal near the end of 2000)

Who will benefit from this article: NT administrators, Web managers looking to achieve the ultimate in reliability.

What you’ll learn: How to keep your site running even in the event of a natural disaster.

Bottom Line: Your Web presence is a key component of your business.  You can’t afford to have it offline, ever.  You’ve provided for redundancy at every level, but there is still one glaring single point of failure—your Web farm. 

Every time Amazon.Com is offline, even for fifteen minutes, it warrants a headline at CNN.  Hundreds of thousands of users spend money at Amazon’s site, and if it happens to be offline when they want to purchase something, those users are more likely to visit a competing site than wait for Amazon to come back online. 

Amazon understands the relationship between downtime and lost revenue.  They’ve taken many steps towards protecting against downtime.  Nonetheless, even the most resilient sites fail from time-to-time.  If you’re designing a Web site with similar uptime requirements, or if you just want to have a way to let your users know that you’ll be back online in a few minutes—you need a geographically distributed hosting solution.

In last month’s article, I discussed improving a Microsoft technologies-based Web site’s reliability by designing for system redundancy.  In this article, I’ll expand on that same idea and discuss the creating Web solutions that span multiple data centers.  After reading it, you’ll know exactly how to create a Web site that never goes offline—even if a tornado destroys your primary Web farm.  I’ll describe the various technologies you’ll need to implement, and give you links to vendors that provide the best-of-breed implementations. 

Something for Everyone

You get what you pay for, but any commercial site can afford some level of geographic distribution.  In fact, most can’t afford not to have it.  In this section, I’ll describe the different levels of protection and give you an idea of what each costs.  We’ll leave the technical nitty-gritty for the next section.

User Notification

For as low as a couple hundred dollars a month, you can host a page at a secondary Web farm that notifies your customers of the outage and offers to send them an e-mail when the site is back up online.  This architecture is illustrated in Figure 1.  Sure, you’d rather have your visitors browsing your fully functional site… but it’s much better than showing them a “Cannot find server” error, isn’t it?  After all, without any information, your users may just assume you’ve become the next Pets.com!

 

Figure1: Ensuring continued business presence requires little more than a single Web server in a remote location

The price of a single Web server is pretty low to give you the peace of mind that you will always have some form of a Web presence.  Depending on your security requirements, a firewall in front of the backup Web server may not even be necessary.  And, once you’ve got the infrastructure of a Web server in a secondary data center, there’s nothing to stop you from adding more backup content than a simple downtime notification.  If your Web site is entirely static, your task is as simple as configuring file replication.  If your Web site supports online transactions, message boards, or personalization, you’ve got a much more complicated (and expensive!) task before you.

Active/Passive Fail over

So, it’s not enough that your users are notified of downtime.  You need them to have access to your entire site, even if the primary Web farm has failed.  This is doable, but it’ll cost you a little more—in the form of hardware, development time, and ongoing maintenance. 

First, a single Web server isn’t going to be sufficient.  You need to plan for the worst-case scenario—the primary site will fail at your busiest moment.  For your backup site to handle that load, you’ll need to have the same processing capability as the primary site.  If your primary site is a single two-processor Web server, you’ll need to have a similar two-processor server in your backup site.  If your primary site consists of four Web servers and an eight-processor database server, you’ll need to replicate that hardware.  You can reduce the costs of the backup hardware by mirroring the primary site, but not providing for redundancy.  After all, the entire secondary Web farm is redundant—so unless you’re a belt-and-suspenders type person, save yourself some cash and buy the minimum hardware required for your backup Web farm.  This configuration is illustrated in Figure 2.

Figure 2: For a only a little less cash than what your primary site cost, you can configure a backup Web farm to keep your business running at all times.

The cost of maintaining a fully-functional backup Web farm comes in several forms, and hardware cost is the least of them.  For two data centers to contain identical content, you need to set up several forms of replication.  DFS file replication takes care of your HTML, ASP and images.  Microsoft SQL Server 2000’s database replication moves your data… But how will you replicate application updates, configuration details and security restrictions?  We’ll cover replication technologies in greater detail later in this article.

Once you get all of your content replicating, you’ll need to create a plan for what will happen when the primary data center does fail.  Your fail-over procedures must include steps for detecting the failure, bringing the backup site online and redirecting user traffic.  If you want to minimize downtime, you’ll need to automate these procedures, too.  Also think about what you will do when the primary site returns online.  The fail-back to a primary site is generally even more complex than the fail-over was!  Creating and testing these processes will consume several weeks of your administrator’s and developer’s time—so be sure to plan for that additional cost.  I’ll go into more detail about this later in the article.

Active/Active Load-Balancing

You love the idea of active/passive fail over between distant cities, but hate the idea of having all that backup hardware sitting idle 99.9% of the time.  In a perfect world, both sites would always be running simultaneously and incoming user requests would be automatically forwarded to the closest Web farm.  This is attainable, but it’s more complex than you may realize.

Active/active configurations have all the complexity of the active/passive configuration, plus a few more headaches.  For example, if you sell Playstation 2s on your site, you need to ensure that the very last system in your inventory isn’t sold to different users logging into your east and west coast Web farms.  Users will be actively surfing both sites simultaneously, so replication needs to be bi-directional.  Depending on your needs, the replication may also need to be transactional.  I’ll discuss this type of replication later in the article, but it essentially means that all database updates need to occur in both Web farms simultaneously.

Fail-over and fail-back procedures are more complicated when the site is active/active.  Your plans must accommodate either Web site failing—and this is significantly more challenging.  Ultimately, the reason active/active implementations are so rare is that it requires you to design your application around the concept.  For example, your database will need to behave differently depending on whether one or both sites are currently online because it will have to be intelligent enough to detect the failure of the mirror site, and begin storing updates instead of replicating them. 

Technology Overview

To this point in the article I’ve been describing geographically distributed hosting from a conceptual level.  That’s fine for high-level managers, but if you’re involved in the actual implementation of this type of solution, you’ll need to understand each of the different technologies involved.  Fortunately for you, very little needs to be home-grown.  Major vendors like Cisco and Microsoft provide the tools you need to build this type of solution.  The bad news is that no single vendor can meet all of your needs.  You’ll need to understand the different layers of technology involved, and identify solutions for each.

Geographic Load Balancing

Regardless of the type of solution you’ve decided to implement, you’ll need some method for directing users to a backup site when the primary site fails.  The most common method is to implement an intelligent DNS server.  This DNS server will recognize requests for your Web site, determine which of your Web farms is better able to handle the request, and respond to the user with the appropriate IP address.  In this section, I’ll examine a handful of products that provide this type of functionality.

Manually Updating DNS

Sometimes the simplest solution is the best solution.  The simplest method of redirecting traffic to a backup Web farm is to manually update the DNS entry (or entries) for your Web site. 

This is simple, but slow.  At best, it will take about ten minutes to notice that the primary site has failed, open your DNS management software, make the modification, and push the change to your authoritative DNS servers.  This is the best-case scenario—ten minutes of your users receiving “Server not found” errors.  The worst-case scenario is bad—what if no knowledgeable administrators are available to make the switch?

If you don’t have staff available 24 hours a day, manually changing the DNS in the event of a Web farm failure is out-of-the-question.  If you do have the staff, it’s an option worth considering.  Its primary advantage is that it allows for the most intelligent switching mechanism available: human beings.  For example, if you know that your primary Web farm is only going to be offline for a few minutes, it may be better not to make the switch at all.

Because human intervention is required, you have the option of manually initiating fail-over procedures.  For example, if database replication is involved, you need to break the replication subscription on your backup database server before bringing that site online.  If you choose to use an automated switching technology, you will have to build scripts that perform these tasks automatically in the event of a failure. 

Finally, no switching technology is needed if you manually update your DNS.  Therefore, it reduces the overall cost and complexity of your distributed solution.  It’s quick-and-dirty, but manually updating DNS to provide fail-over between Web farms provides for the fastest deployment.  For that reason, it’s commonly used as a stepping-stone to automated switching solutions.

Automatic Switching Technologies

There are several products available to direct Web requests between different data centers.  At a high-level, these technologies intercept browser requests sent to your Web site.  They then direct the user to the best available Web farm.  These products can redirect users through either DNS or HTTP 302 Redirects.  I won’t go into any more detail here, because this discussion is geared around systems engineering and Microsoft technologies.  If you’d like to look into these products further, you can find more information on the most popular products here:

Routing

It’s definitely the least common method described here, but several sites have been known to implement geographic load-balancing using intelligent routing protocols like OSPF (Open Shortest Path First).  Routing protocols reside on the routers that make up your network infrastructure and are responsible for determining the path packets take towards their destination.  For each incoming IP packet, the router must examine the destination IP address, and forward the packet either to another router or to the final destination.

Here’s the trick: configure your Web site in two locations using a single IP address, as shown in Figure 3.  You’ll need a solid understanding of routing and a very flexible, tier 1 ISP.  It’s not the most popular method for a good reason: it’s very complicated.  Nonetheless, I feel confident that all load distribution will eventually be moved into the network—so expect to see this become common in the next five years.

Figure3: It's possible to rely on the Internet routing infrastructure to distribute requests

File Replication

Microsoft has finally given us a working file replication solution.  You’ll have to look carefully, though, because it’s hidden within the Distributed File System (DFS).  If your Web servers use Windows 2000 Server as their operating system, you can configure DFS so that a single file share exists on all of your Web servers.  Windows takes care of the rest.  If you update a file on one server, that change is automatically pushed to every other server participating in the DFS replication. 

DFS does not provide a complete content management solution, however.  It doesn’t have half the functionality of the content replication services built into Site Server 3.0—except that DFS replication actually works reliably.  For most Web sites, DFS replication will be adequate for ensuring content remains identical between multiple servers.  It even handles fail-over and fail-back situations gracefully.  So, if file updates occur when one of your Web farms is offline, those changes will be automatically applied once the site is back online.

If the included DFS replication doesn’t suit your needs, check out these third-party file replication technologies:

Database Replication

Database replication makes file replication seem simple.  Typically, files are updated only when changes are made to the Web site content.  If it takes a few seconds for these updates to replicate to all ten of your sites Web servers, you’ll probably never know the difference.

Databases, on the other hand, are updated constantly.  If your site uses personalization services, the database may be updated every time a user clicks a link.  If you’re using an active/passive solution, you need these updates to be constantly available at your backup Web farm.  It’s okay if the update is replicated a second or two after it’s written to the primary site, so lazy replication is sufficient.  It’s possible to use lazy replication in active/active architectures, but you must carefully consider your application’s functionality in order to understand the implications.

If you’re using an active/active solution, much (if not all) of your database changes must be replicated the instant they occur.  In fact, to be sure that both Web farms are synchronized, the transactions must be replicated before they are committed.  Database administrators call this transactional replication.  With transactional replication, you are 100% assured that an update was applied to all replication partners. 

To understand when to use transactional replication, consider the example I used earlier in an active/active architecture.  Two people, one connecting to your west coast Web farm, the other accessing your east coast presence, each want to buy the last Playstation 2.  If your inventory tables are configured for lazy replication between the two Web farms, it’s possible that both users would be allowed to purchase the last Playstation.  After the transaction was committed and the user notified of their purchase, the databases would each attempt to replicate the transaction—only to generate a consistency error when they attempted to modify the same record in the inventory table.  You can deal with the consistency error either manually or programmatically, but one of your users is going to be very disappointed.  This problem wouldn’t have occurred if transactional replication was used on the inventory table, because the first transaction would have caused the inventory count on both databases to be decremented to zero before the user was notified of their purchase.

The drawback to transactional replication is the greatly increased latency.  Updates must be sent from one Web farm across the Internet to the second Web farm.  The second database server must then commit the change and send an acknowledgement back to the first database server.  Therefore, updates to database tables that use transactional replication will take longer to commit—as long as it takes traffic to traverse the network between your Web farms.  If you must use transactional replication, be sure that the backbone between your Web farms is fast enough to provide your users with an update in a reasonable amount of time. 

Another complexity involved with using transactional replication in real-time distributed hosting environments is the possibility that one of the Web farms will fail.  I mentioned earlier that fail-over and fail-back procedures were more complicated for active/active architectures, and transactional replication is largely to blame.  Since both databases have to wait for a response from the other before committing a transaction, what happens when one Web farm fails?   Until the transactional replication relationship is broken, no updates will take place.  Your application must be intelligent enough to handle the inevitable event of a Web farm failure, abandon hope of replicating updates, and convert itself into a stand-alone Web farm. 

Replicating Everything Else

I warned you that distributed hosting solutions were hard, right?  As if file and database replication weren’t difficult enough, it gets worse.  Microsoft provides no way to automatically replicate system, application and security configurations.  If you tweak the TCP window size on one of your Web servers, you’ll have to manually edit the registry on every one of your servers, in all of your Web farms.  If you decide to restrict anonymous access to a DLL in your system directory, you’ll have to manually edit the ACLs on all systems.  Every time you update one of your Web application’s COM objects—you guessed it—make that change on each of your servers.

With so much manual labor required, the chances of human error being introduced are nearly 100%.  After you manage a distributed Web site for a few months you’ll start to discover inconsistencies between servers that are supposed to be identical.  I can’t suggest any technology that solves this problem for you, because they don’t exist yet.  The best I can suggest is to keep very careful records of updates, and to perform audits on a regular basis.

There is hope on the horizon, and Microsoft has named it Application Center Server.  This product is designed to handle the replication of many aspects of your application configuration, both between servers in a single Web farm and between geographically distant Web farms.  As I write this, it’s still in beta, but it may be released by the time this is printed.

Find more information on Application Center Server at http://www.microsoft.com/applicationcenter/

Disaster Strikes: Failing Over and Failing Back

After all the effort you’ve put into developing a active/passive, geographically redundant Web site, you may feel thrilled when something terrible does happen!  Maybe that long-awaited earthquake finally strikes California, and your Silicon Valley Web farm can’t be reached from the surviving 49 states.  If you’ve automated your fail-over procedure, just kick back and watch as your users requests are directed to your secondary Web farm.  If you’re relying on manual fail-over procedures, you’d better act fast.  Here’s a sample high-level fail-over procedure to give you an idea of what you need to keep in mind:

Sample Fail-Over Procedure

  1. Ensure no updates can be applied against primary site.  If the primary site is flapping (appearing and disappearing), disable it completely.

  2. Break file and database replication relationships on the secondary site.

  3. Set secondary databases to allow updates.

  4. Redirect HTTP requests to the secondary site.

  5. Verify functionality and performance of the secondary site.

Your fail-over procedures went off without a hitch, and you were able to take your time repairing your primary Web farm.  You’re confident the site is stable, and would like to send requests to the primary site once again.  This is actually more difficult than the fail-over procedure, but you have the luxury of scheduling the change to your Web site’s least busy hours.  Read through the sample fail-back procedure to get an idea of the tasks you need to handle when taking your secondary site back offline.

Sample Fail-Back Procedure

  1. While keeping the failed primary site offline, verify that it is behaving reliably.

  2. Freeze secondary site from changes to non-dynamic data (everything except your database).

  3. Replicate files, configuration information, and anything else that has been updated on your secondary site since the primary site went offline.

  4. During a period of planned downtime (i.e., at 3am), take the secondary site completely offline.  Optionally, direct users to a static page describing the outage.  Your site must be taken offline such that no updates can occur at the secondary site while the primary site is being re-synchronized.

  5. Replicate (or manually copy) database updates from the secondary site back to the primary site.  Carefully verify that all updates applied against the secondary site while it was online have been copied to the primary site.

  6. Reconstruct replication subscriptions such that the secondary site is once again receiving updates from the primary site.

  7. Verify that the primary site is functioning correctly, including accepting updates.

  8. Redirect HTTP requests to the primary site.

Summary

The vast majority of the Web sites on the Internet will completely disappear if there is a network failure at their data center.  Even if your hosting provider assures you of complete network redundancy, failures will happen.  If your Web site is too important to accept this downtime, you need a solution that spans the globe.  You need a geographically distributed Web site to keep your site online even during the most heinous of Internet outages and natural disasters.

But there’s a reason most sites settle for residing within a single Web farm: it’s very costly to distribute your complete application.  If cost is a concern, consider placing a stand-alone Web server in a remote data center simply to notify users of a failure at the primary site—it’s better than nothing.  For many commercial sites, the cost of being offline is far greater than the cost of building a distributed application.  If that describes your Web site, you’ll need to distribute it between at least two different locations.

With the technologies outlined in this article, building a distributed hosting solution is straightforward.  However, designing such a solution is exceptionally complicated, so I suggest finding a tier 1 hosting provider who has multiple data centers for Web hosting, owns their own Internet backbone, and has experience implementing distributed solutions.  If you’re implementing an active/active solution, you’ll also need programmers skilled in building distributed applications.  It’s a long path, and it’s the next logical step towards bringing serious business to the Web.

Add an anonymous comment!
Hide comments!


The History of Parliament is a major academic project to create a scholarly reference work describing the members, constituencies and activities of the Parliament of England and the United Kingdom. (1/9/2007, 4:05 AM)
History of the United States of America. (1/8/2007, 2:47 AM)
Help to choose a videocamera. What standard to choose? (4/24/2006, 3:44 AM)
hi Prompt how to get rid of advertising? (4/9/2006, 11:12 PM)
Great page, not too fancy but well balanced! Cheers! (2/25/2006, 12:36 PM)
Hi. Images are not loading. (2/1/2006, 10:26 PM)
Than will be engaged today? (1/16/2006, 3:12 AM)
Hi! And at whom what animal of a house? (1/15/2006, 5:11 AM)
hi Why do not answer my question? (1/13/2006, 9:34 AM)