Figure 2 shows a Storage Area Network (SAN) environment. You connect the SAN hardware to the server through a fibre channel connection (preferably redundant) and access the file system as if it were local. You can use a snapshot utility on the SAN to perform a quick backup (typically measured in seconds), then restore the data from a backup disk almost as quickly. As long as you create snapshots relatively frequently, you can restore data within the confines of even the most stringent SLA. Snapshot functionality might even be irrelevantat my company, for the second half of 2001, our EMC SAN experienced 100 percent uptime, our Brocade switches boasted 99.9999861 percent availability, and we never experienced disk problems that required us to restore from snapshots. If you're concerned that the SAN might fail, you could implement redundant SANs with failover technology.
Unfortunately, unless a SAN replaces hundreds of small file servers, many hardware snapshot products and SAN technologies are prohibitively expensive. Even a small 400GB EMC and Brocade SAN infrastructure can cost $300,000. When you compare that price with that of two servers, each with 200GB of local storage$19,400you begin to understand the cost of very high availability. Cost-effective SAN and Network Attached Storage (NAS) vendors exist, so be sure to shop around carefully before you pass any final judgments about SLA costs.
Fault-tolerant systems. The two previous high-availability strategies focus on reducing the time necessary to restore the server and data. The third strategy involves implementing redundant systems that continue serving the client indefinitely if one system fails. You can make many components redundantservers, disks, NICs, UPSs, switches, and so on. Some of these components are easy to add and relatively inexpensive. For example, if you add redundant NICs, power supplies, and disk controllers to the aforementioned ProLiant DL380 system, the cost rises from $9700 to about $11,600. However, ask yourself whether you need to spend that money. At my company, we're experiencing less than 0.025 percent failure rate on those components. (Probably the most crucialand by far the cheapestitem you need is a UPS. If you don't have a UPS, put this article down and deploy one before reading any further.)
Let's look at three technologies for implementing redundant data and redundant servers: Dfs, RAID, and server clusters. The system in Figure 1 distributes user data across several physical locations and uses Dfs to present a simple, logical view of the data. If you have an exact replica of any of the file directoriesfor example, the \products directoryyou can mount both the original and the replica at one point in the Dfs namespace. When users traverse a directory tree in Windows Explorer to reach the \products directory, they might be viewing data residing at the original or the replica. Dfs doesn't require that data in the various shares mounted at one point be identical, but you can configure Dfs so that it replicates the data on a schedule. If the server holding the replica crashes, users can still open files in the original locationa somewhat fault-tolerant scenario. If you configured multiple replicas, you would have even more redundancy. (If you want to have redundancy for the top-level share as well as the replicas of the data, you need to create a fault-tolerant Dfs root. For more information, see the Dfs documentation.) Any data not written to disk on the crashed server obviously would be lost, as would any nonreplicated data. The Dfs replication process isn't well suited for highly dynamic data, so you need to evaluate this technology carefully to determine whether it's appropriate for your SLA strategy.
You can set up Dfs to replicate data between servers. RAID addresses distribution and replication of data between one server's disks. On paper, computer disks boast extremely high reliabilityfor example, Seagate Technology claims that its Cheetah 36GB Ultra 160 SCSI disk provides 1,200,000 hours mean time between failures (MTBF), which means you can expect a failure approximately every 137 years. However, this MTBF value is deceiving. Hard disk failures are common and have many external causes. At my company, hard disks are our top-ranking hardware service item; we repair or replace an average of 66 disks per year in one data center that holds approximately 8000 disks of various ages. Executive Software's study "Survey.com Hard Drive Issues Survey" (http://www.execsoft.com/diskalert/reviews/hard-drive-survey.asp) uncovered similarly frightening statistics: 62 percent of data-center IT administrators rank disk failures as their top disk problem and estimate the average life of a SCSI disk at 3 or 4 years. The trick is to implement fault-tolerant RAID technology so that disk failures don't result in downtime.
Figure 3 shows a simple fault-tolerant RAID configuration. RAID 1 technology mirrors the disks in realtime: If one disk crashes or becomes corrupted, the other disk continues to operate as usual and the system sees no performance degradationalthough it's now operating without redundancy. The system that Figure 3 shows uses RAID 1 mirroring for the OS and the swap files. If any disk fails, you can remove it and replace it with a healthy disk, without turning off the computer. The RAID 1 SCSI controller creates a copy of the OS or swap file on the new disk, then reestablishes fault tolerance. No downtime occurs as a result of a disk failure, albeit at the expense of doubling the number of drives in the system. (For further information about RAID technology, go to the Advanced Computer & Network Web site at http://www.acnc.com/index.html.)
Prev. page
1
[2]
3
4
5
next page