SideBar    Scaling Up vs. Scaling Out

Real-world implementation and high-availability design guidelines

Today, systems administrators are facing the challenge of making Windows 2000 available more than 99.9 percent of the time. To address this challenge, Microsoft has partnered with several top-tier OEMs to deliver and support Win2K Datacenter Server. The result of this collaboration is the Windows Datacenter Program, which provides customers a list of certified configurations that Microsoft has thoroughly tested for reliability. Hewlett-Packard (HP), an OEM involved in the Windows Datacenter Program, has been working through the challenges and pitfalls of Datacenter implementations. Learning from their experiences, HP engineers and consultants have developed a valuable list of best practices to share with Datacenter customers around the world. With these best practices in mind, you can more easily decide whether Datacenter makes sense for you and see what you must do to create your own high-availability infrastructure.

For more information about the Windows Datacenter Program, see Greg Todd, "Win2K Datacenter Server," December 2000, and the Microsoft article "The Datacenter Program and Windows 2000 Datacenter Server Product" (http://support.microsoft.com/support/kb/articles/q265/1/73.asp). You can also visit Microsoft's Datacenter Web page at http://www.microsoft.com/windows2000/datacenter.

High Availability 101
Does your environment need a high-availability solution? To determine which high-availability technologies are relevant to your environment, you need to understand your availability requirements. Only then can you begin to design an infrastructure that meets your needs.

You also need to understand the difference between fault resilience and fault tolerance. Fault-resilient systems consist of clusters that achieve high availability through failover. Microsoft Cluster service is a clustering solution that makes Datacenter and Win2K Advanced Server fault-resilient. Cluster nodes have independent system images, and failover can take from a few seconds to several minutes. (A system image, which completely describes the point-in-time status of a particular system, is unique to each computer system and changes rapidly. This image includes such information as memory, CPU registers, disk and memory buffers, and message queues.)

Applications on fault-resilient systems use checkpoint files to recover application data. A checkpoint file is a log file, such as a database transaction log, that lets an application recover its state—the processing stage of the application at a certain point in time—after a power failure or hardware failure. Following a failure, the application first looks at checkpoint files stored on the disk to either roll forward or roll back transactions that were incomplete at the time of failure. Fault-resilient systems recover only to the most recent checkpoint. Information not saved to some form of checkpoint file (i.e., residing only in memory) will be lost on failover.

Fault-tolerant systems, which have tighter coupling of resources, keep applications available by protecting one system image. Applications that run on a fault-tolerant system don't require checkpoint files—they simply depend on the underlying fault-tolerant platform to keep the system running. Proprietary and highly customized hardware and software characterize fault-tolerant systems. Therefore, fault-tolerant systems are typically more expensive than their fault-resilient counterparts. When constituent components fail, redundant components take over so that the system image runs uninterrupted. Most high-availability computing uses fault-resilient systems, which don't require the same level of expensive custom hardware or software. However, fault-tolerant systems can more commonly achieve 99.999 percent planned availability.

In terms of high availability, a key difference between fault-tolerant and fault-resilient systems is recovery time. Fault-tolerant systems boast recovery times that approach zero. Fault-resilient systems (i.e., Cluster service clusters) have recovery times that range from a few seconds to several minutes because of the time necessary for failover.

By the Numbers
Availability is the ratio of the amount of time that a system is available to the amount of time the system should be available. Industry convention is to express availability as a percentage. The mythical perfect system would be available 100 percent of the time. Real systems, of course, post lower percentages.

You can use the simple calculation

A = MTBF/(MTBF+MTTR)

where A is availability, MTBF is mean time between failures, and MTTR is mean time to repair (or recover), to find a system's availability. "Three nines" conveys that availability is 99.9 percent, "four nines" conveys that availability is 99.99 percent, and so on. If you use 20 minutes as the MTTR value (Microsoft claims 20 minutes is the average time necessary to restore a Win2K or Windows NT system) and .999 as the A value, you get an MTBF value of approximately 14 days. (Not coincidentally, 14 days is the duration of the Microsoft stress test for Datacenter hardware and kernel-mode drivers.) The primary high-availability design goal is to increase A by increasing MTBF and decreasing MTTR.

Table 1 gives an overview of availability in terms of nines. The table's downtime numbers are measurements of unplanned downtime. (In today's world of high availability, techniques such as online backup and rolling upgrades for system maintenance or hardware updates keep planned downtime close to zero.) Do you need three or more nines? Costs can increase 10-fold for each nine that you add. Take a close look at your business. What does downtime cost you? To justify a high-availability solution, you need to start by calculating the cost of an unavailable system. Table 2 shows sample downtime costs per hour from various industries. Table 3 shows causes of downtime as evenly divided among planned outages, software, and physical factors (i.e., people, hardware, and environment).

Glancing at this data, you can easily understand the importance of people and processes to achieving high availability. In a recent white paper, "Increasing System Reliability and Availability with Windows 2000," Microsoft refers to industry studies showing that 80 percent of system failures are the result of human error or flawed processes.

   Prev. page   [1] 2 3 4     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Oh no! HP doesn't do datacenter. If you think installing DC on an 8 way server just to satisfy the microsoft's dc programme rules, yes, HP does this very well. The truth is; Unisys ES7000 is the FIRST and ONLY intel based machine that can run W2K DC on 32 CPUs and 64Gb RAM (the figures that you love to mention about).

Yavuz Guceri