• subscribe
January 14, 2010 12:00 AM

Disaster Recovery Plan Testing 101

Don’t let a disaster be your first test of your recovery plan
Windows IT Pro
InstantDoc ID #103400

Failover and failback. If you’re running in a redundant or high-availability environment, you should regularly test that capability by initiating a failover operation. Make sure that not only does your system fail over to its backup but that you can successfully fail back to your main production system when the disaster is over.

Test of power backup (generator/UPS). All the backups and redundant servers in the world won’t do you any good if your computer center doesn’t have power. Most organizations have a good uninterruptible power supply (UPS) system as the first line of defense and a generator for long-term power backup when grid power is out.

You should test your UPS systems for their ability to carry load. The batteries in these units typically don’t last more than a few years. You are probably constantly adding equipment to your racks as well.

Make sure that your UPS systems keep up. They should hold your entire computer center long enough for your generators to kick in or for you to safely power down your computers if you don’t have a generator.

The easiest way to test this in a smaller computer center is to pull the plug and see what happens (make sure you’re ready for downtime first). In larger environments, monitoring and testing software can assist with this.

Generators should be regularly started up and serviced. Again, some of the larger units can be programmed to do this automatically. But you might want to force the issue and cut the building power and see how fast the generator kicks on.

You should probably run it for an extended period (say a day or more) to monitor fuel usage and heat and exhaust dissipation, to make sure it will run for the long term. Many companies in Houston were running on generator power for weeks after Hurricane Ike. Finally, make sure you have sufficient fuel for them. If your fuel vendor fails to show up, then you are out of luck.

Hot/warm site. If you have contracted with an outside company or plan to use your own facilities for a hot or warm site recovery, you should test these capabilities on a regular basis. Most companies that offer such services will allow you to do this and should be able to accommodate you, though there may be charges for such a service. If they don’t, you should question their ability to provide the service.

A standard test of the service involves cutting over to the recovery site and having staff on hand to process a set of sample transactions. The closer you can get to regular work volume, the more you will get out of your test.

Don’t forget the staffing element of all these plans. Your test should be done with personnel in place. Make sure everyone knows what to do and that different team members are sufficiently cross-trained.

Operations Testing. The best way to test your disaster recovery plan technically is with real users doing real production. If possible, non-IT people should be part of your technical tests. Just bringing the server up and being able to log into it as an administrator isn’t a true test: Put real users on it and make sure it can handle them with no hidden complications.

Common issues include bandwidth and processor capabilities on backup servers, authentication and user rights, and outside connectivity. If you’re testing as a single administrator, you aren’t going to see some of these errors. Being local to the server (on the console) will make you miss most connectivity issues.

Some environments might not allow this kind of risk of downtime, so you might have to assemble a group of test users and a test data set. Make sure these test users are sufficiently heterogeneous (sales, accounting, field service), from different parts of the network, and using a diverse set of features.

Testing Methodology
After you figure out what kind of test you want to do and when, it’s time to plan your test. Badly planned disaster recovery tests can turn into real disasters in a hurry when a downed system fails to come back. You will get a lot more out of your test if you follow these steps for successful disaster recovery testing:

Plan. Think about what could go wrong. In other words, have a disaster recovery plan for testing your disaster recovery plan, especially for mission-critical apps.

Write down your test plan. Account for any possibilities where it could go wrong, and note what you expect to get out of the test. Specify what counts as a successful test or an unsuccessful test. Usually there are multiple categories, and a test is not a simple failure or success.

Also, make sure you have a plan for how to return to normal operations. If you failed over to a back-up server, how and when do you fail back to the production unit? The devil is in the details, and the more details you have covered, the more likely your test will run safely and successfully.

Notify. Make sure that you notify all potentially affected users of the system that you intend to test it and how it might affect their productivity. Most tests are done on the weekends or late at night to minimize any downtime. Another benefit of this approach is that if something does go wrong, you’ve got some time to make it right before the mass of employees show up at work.

Of course, in some environments such as a hospital there really is no “good” time for downtime. Also, some situations might call for doing mid- week tests when vendor representatives are available or tech support can be expected. If your test goes longer than you expect or you encounter problems that will affect users, make sure you update them with progress reports and expected return to normal operations.

Execute. Execute your test according to your plan and use your written disaster recovery plan to recover. Does the recovery track according to plan? Are steps left out or not well documented? Here is where institutional knowledge can be developed and put to paper so all can benefit. And that leads to the next step.

Record. Make sure someone is assigned to record the test and its results. If possible, have a report format to capture the results so that you won’t be dealing with someone’s unintelligible notes after the fact.

Documenting disaster recovery tests is one of the areas where most companies fall down. If you don’t have a record of what happened, how can you expect to learn from it afterwards? And that leads to the next section which is the “lessons learned” meeting.

Review and improve. Now you dissect the test, see what went right, what went wrong, and how to do better next time. Closing the loop on your test in this manner is the best way to get future benefit from your tests.

Make sure you assign specific action items to address, then review those items to make sure they got taken care of. Do it fairly soon after the test so details are still fresh on everyone’s mind. This cycle of review and improvement is the final step in making sure your disaster recovery plan evolves into the future.

Testing, Testing
Obviously your testing can involve endless variations with different configurations, systems, applications and organization types. But in the end, the concept is the same: Test, test, then test some more.

The more you test, the more likely you will be ready when the real disaster occurs. And as we all know, it’s not a matter of if, but when that hurricane hits or that system crashes or whatever disaster Fate has in store for your organization.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here