SideBar    Disaster Recovery Means Availability, Too

Failover Clustering
The primary high-availability technology in SQL Server 2000 is failover clustering. In "Clustering SQL Server," page 27, Brian Knight walks you through the details of setting up a two-node SQL Server 2000 cluster on Windows 2000 Enterprise Edition. Before SQL Server 2000, failover clustering wasn't pretty and rarely worked well. However, Microsoft completely rewrote failover clustering in SQL Server 2000, and the technology became easy to implement because of SQL Server 2000's many new capabilities. You manage the cluster the same way you manage a single-machine SQL Server installation. Unless you're directly interacting with the desktop and Cluster Administrator (the utility that lets you set up clustering), SQL Server doesn't expose the fact that it's running on a cluster. This transparency makes administration of a cluster simple.

Although administrative procedures don't change when you go from a single-machine configuration to a multimachine configuration, you gain an important capability. In the event of hardware failure, the cluster automatically moves the SQL Server 2000 application to another available piece of hardware in the cluster and brings it online. If you've used certain development practices (which I describe later), your end users will never know that you had a hardware failure. This seamlessness is important because the availability rate at the server level is a relatively useless number. Only the availability at the data-consumer's desktop is important.

When the piece of hardware (called a node) that SQL Server is running on fails, SQL Server shuts down. The Microsoft Cluster service moves the ownership of the disk space to another node and starts up SQL Server. The failover of resources in a cluster occurs within about 15 seconds under normal circumstances. However, many people don't realize the implications of SQL Server shutting down and restarting. When SQL Server starts, it runs through a specific, sequential process called restart recovery. The crucial part of restart recovery is the undo-redo phase for a database. The redo phase is always short; you control it through the recovery- interval configuration option. In the undo phase, you're at the mercy of how an application was coded. Long-running transactions can severely degrade the cluster's failover time.

For example, say a user initiates an operation in the database that takes 6 hours to complete. At 5:59:00 into the process, the node running SQL Server fails. The Cluster service takes over, moves resources, and brings SQL Server online within 15 seconds. Although the cluster failed over in 15 seconds, users now have to wait nearly 6 hours before their data becomes available again because of the undo phase of recovery. Had the developer written this application so that it ran in recoverable chunks with a maximum duration of 3 to 5 seconds, the cluster would have failed over and data would have become available to users within 20 seconds of the primary node failure.

The other area where close developer support can produce an environment that gives significantly higher end-user availability is in the connection and transaction handling inside an application. Regardless of the hardware a SQL Server is running on, it will always have the same name. If a user is connected to a clustered SQL Server that suddenly fails over, that user is disconnected. To continue processing, all the user has to do is reestablish the connection to SQL Server. Developers can use this opportunity to make cluster failovers transparent to end users by programming the application to detect a disconnection from SQL Server, wait a few seconds, then reconnect. Once reconnected, the application should be able to reissue any transaction that hadn't completed before the failover. Leveraging the application this way allows for a resilient system architecture that ensures that no user data is lost even during a system failure.

Now, let me dispel some common misconceptions about clusters. A SQL Server cluster is a hardware-protection mechanism only. A SQL Server cluster won't protect you from data failures, application-logic failures, user errors, or data corruption. To protect against these errors, you need to leverage database backups. A SQL Server cluster also doesn't increase performance or scalability. You might have multiple pieces of hardware acting as one resource, but SQL Server can't access resources on another piece of hardware, and an application can't scale any further on a cluster than it can on one node.

Log Shipping
Because failover clustering only protects against hardware failures without any ability to offload processing or produce scalability, it isn't all you need for high availability. Log shipping, another SQL Server high-availability technology, is a fancy name for automating a database backup and restoring to a different server. Several articles by Ron Talmage (see the Related Reading box) cover log shipping in detail, so I won't examine the internal mechanisms of the technology here.

Log shipping provides a resilient option for protecting against hardware failure. Log shipping works by copying a primary server's backup files to a secondary server, then restoring the files on the secondary server. The basic design of log shipping provides for a primary server and a secondary server that are independent, not sharing any hardware components. This configuration removes the single point of hardware failure present in a cluster that uses a shared drive array.

You can use log shipping to provide significantly higher data redundancy with many more environments than is possible with failover clustering. For example, Windows Server 2003 Datacenter Edition supports a maximum of eight nodes in a cluster. But the only thing that limits the number of secondary servers you can have with log shipping is how much infrastructure you can manage; you can create as many copies of the backup as you want. Log shipping can also operate across much longer distances than failover clustering. Most failover clusters are limited to approximately 80 to 100 miles. With log shipping, only the length of the cable between the primary and secondary servers limits the geographic dispersion of your secondary servers.

Prev. page     1 [2] 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Thank you, the article was very helpful.

ITS

Article Rating 5 out of 5

 
 

ADS BY GOOGLE