Basic blueprints for file servers, Web servers, and DNS servers
You'll find a glut of articles that discuss high-availability concepts and strategies, as well as a plethora of articles that cover the engineering details of high-availability solutions' components. You're probably ready for an article that shows you how to put those components to use in your IT environment. Perhaps you've promised a high service level agreement (SLA) to your customers, and now you need to know how you're going to keep that promise. If you need to configure a high-availability Windows 2000 file server, Web server, or DNS server, you'll find this article's collection of basic blueprints extremely valuable.
High-Availability Management Phases
Application availability is inversely proportional to the total application downtime in a given time period (typically a month), and the total downtime is simply the sum of the duration of each outage. To increase a system's availability, you need to decrease the duration of outages, decrease the frequency of outages, or both. Before I discuss useful technologies, you need to understand the phases of a postoutage restoration.
In the event of a serious outage, you would need to build a new server from scratch and restore all the data and services in the time available to you. Suppose you've promised an SLA of 99.5 percent, and you're counting on only one outage per month. (For information about calculating availability percentages, see the sidebar "Measuring High Availability.") Within 3 hours and 43 minutes of the start of an outage, you would need to work through the following five restoration phases:
- Diagnostic phaseDiagnose the problem and determine an appropriate course of action.
- Procurement phaseIdentify, locate, transport, and physically assemble replacement hardware, software, and backup media.
- Base Provisioning phaseConfigure the system hardware and install a base OS.
- Restoration phaseRestore the entire system from media, including the system files and user data.
- Verification phaseVerify the functionality of the entire system and the integrity of user data.
Regardless of your SLA, you need to know how long the phases take. Each phase can introduce unexpected and unwelcome delays. For example, an unconstrained diagnostic phase can take up the lion's share of your available time. To limit how much time your support engineers spend diagnosing the problem, set up a decision tree in which the engineers proceed to the procurement phase if they don't find what's wrong within 15 minutes. The procurement phase can also be time-consuming if you keep the backup media offsite and have to wait for it to be delivered. I once experienced a situation in which the truck delivering an offsite backup tape crashed en route to our data center. You might think 3 hours and 43 minutes is a short period of time in which to restore service, but in reality, you might have only about 2 hours to complete the actual restoration phase.
Blueprint for High-Availability File Servers
A file server doesn't require much CPU capacity or memory. To support 500 users and 200GB of data, you might use one small server, such as a Compaq ProLiant DL380 with two Pentium III processors and 512MB of RAM. With minimal accessories, such a setup costs about $9700 retail. If you use a DLT drive that provides an average transfer rate of 5MBps, physically restoring 200GB of data will take 666 minutes, or 11 hours and 6 minutes. Add an hour for the diagnostic, procurement, base provisioning, and verification phases, and you're looking at 726 minutes to recover from an outage. If you assume one outage per 31-day month (44,640 minutes), then $9700 buys you a file server with an SLA of 98.37 percent.
To increase the availability of file servers, you can use standard strategies: Reduce the time required to restore the file share and data during an outage and reduce the frequency of outages. Many technologies address each of these strategies for file servers. As a starting point, let's look at basic implementations that use the following techniques: data partitioning, snapshot backup-and-restore technologies, and fault-tolerant systems.
Data partitioning. In the configuration that Figure 1 shows, FileServer2 contains product data, FileServer3 contains images, and so on. To make this partitioning transparent to the user, you can implement a technology such as Microsoft Dfs, which lets you create a virtual file system from the physical nodes across the network. A user who connects to \fileserver1\share would see a directory structure that appears to show all data as if it were residing on FileServer1, even though some of the data physically resides on FileServer2.
Table 1 shows the availability you can achieve through data partitioning. (This table uses a typical SLA formula and assumes that you need to restore only one server during the outage.) The cost per server goes down as you accumulate servers and as the number and size of the servers' disks decrease. Obviously, the partitioning option is costly in terms of server hardware, so you need to decide whether you can live with an average of 12 hours 6 minutes (726 minutes) unscheduled downtime per month. Perhaps spending an extra $23,800 ($33,500 minus $9700) to reduce that time to 3 hours 13 minutes (193 minutes) makes sense for you. Data partitioning is particularly cost-prohibitive if you have huge quantities of data and need dozens of servers or more.
Snapshot backup and restore. An alternative to data partitioning is to implement faster technology. Faster tape drives won't necessarily provide a quantum leap in performance, so you'll need to use snapshot backup-and-restore technology, which is typically available in conjunction with Independent Hardware Vendors (IHVse.g., EMC, Compaq) of enterprise storage systems. Upcoming software snapshot products might change this equation, but for now, you need to address the enterprise storage vendors.
Prev. page  
[1]
2
3
4
5
next page