SideBar    Scaling Up vs. Scaling Out

Select Your Administration Tools
Although systems administration tools aren't a component of the Windows Datacenter Program, they're essential to a high-availability architecture. These tools will help you manage all your infrastructure's components. Such tools typically use SNMP and agents to monitor the condition of your site, trap errors, generate alerts, carry out preprogrammed responses to specific conditions, identify dependencies between components, and perform root-cause analysis of dangerous trends. HP's OpenView and Computer Associates' (CA's) Unicenter TNG are popular examples of systems administration tools.

Large servers are often shipped with preinstalled management utilities (e.g., HP's Toptools, Compaq's Compaq Insight Manager) that let you perform a detailed investigation of your hardware while remaining fully online. On a remote server, remote control cards can recycle power, monitor a boot sequence, and provide remote keyboard/video/mouse (KVM) capabilities.

Organize Support
After you design your high-availability infrastructure, you need to align internal and third-party support with your availability requirements. One function of the Windows Datacenter Program is to formalize such support. To do so, Microsoft, the OEM partners, and certified application developers perform extensive testing and change management. OEMs must offer SLAs that outline time-to-repair commitments. (Depending on the contract, OEMs can be available to answer support questions within 30 minutes and can be on site within 6 hours.) You must establish and enforce SLAs with your external and internal support teams.

The Joint Support Queue, staffed by both Microsoft and OEM personnel, provides a well-defined support-escalation path. First, the customer calls first-tier OEM support. The first tier can call the second tier, which can escalate to the Joint Support Queue. The Joint Support Queue determines whether the problem is related to hardware, the application, or the OS. If the problem is hardware-related, the call goes to the OEM's hardware support team. If the problem is application-related, the Joint Support Queue contacts the certified application developer's Help desk. If the problem is OS-related, the call goes to Microsoft Critical Problem Resolution teams, then to Microsoft Quick Fix Engineering. Of course, the problem might be resolved at any point along this path. Some OEMs (e.g., HP, IBM, Compaq) offer consulting and support beyond the Joint Support Queue and Windows Datacenter Program's minimum requirements.

Standards groups in the UK started codifying best practices for systems management in the IT Infrastructure Library (ITIL) in the late 1980s. (For more information about ITIL, go to http://www.itil.co.uk/index.htm.) Most successful high-availability sites today use some or all of these practices. Several training programs and publications are available to introduce you and your staff to the world of high-availability computing and IT Service Management (ITSM). (For more information, as well as a glimpse at the world of enterprise and high-availability computing from the perspective of big-systems management, go to http://www.itsmf.net.)

The Microsoft Operations Framework (MOF) builds upon the ITIL and ITSM and—according to Microsoft—is better suited to the rapidly changing needs of Windows environments. The MOF emphasizes iterative processes for risk assessment, configuration management, and adoption. For more information about the MOF, see the "MOF Executive Overview" white paper at http://www.microsoft.com/trainingandservices/default.asp?PageID=enterprise&Subsite=whitepapers&PageCall=mof#MOFoverview.

Monitor System State
To monitor your high-availability infrastructure, you can leverage best practices from ITSM and MOF and use your systems management tools. Create a CMDB that includes information about CIs. A CI is simply a configurable element of your infrastructure—anything to which you can apply (or from which you can derive) configuration or status information or that can cause the system to be unavailable. The CMDB also needs to include information about dependencies between CIs. A problem's root cause isn't always obvious. Dependency trees can help you discover the root cause of any problem. Your comprehensive CMDB should contain information about every component involved in the task of keeping your infrastructure available. For example, you should include configuration settings, firmware versions, and build or service pack numbers. Create this record during the infrastructure's initial installation, then update it after any change. The CMDB will help you troubleshoot your system and help you bring failed systems back into service rapidly (thereby decreasing MTTR).

Microsoft and third parties offer tools to help you create your CMDB. To get baseline data about hardware, software, and detailed system configuration, use your systems management tools and the utilities in the Microsoft Windows 2000 Server Resource Kit. To detect changes in a Datacenter configuration, run the Datacenter Config Comparison utility (cfgcmp.exe)—a command-line tool in the Datacenter CD-ROM's \Support\Tools directory. For stateful clusters (i.e., Cluster service with two, three, or four nodes), create an initial cluster log, which documents whether the cluster starts and runs correctly. (You and your vendor must debug any errors in the initial cluster log before accepting the installation as complete.) Run Network Diagnostics (netdiag.exe)—a resource kit tool—to ensure that no network problems exist. Also, to ensure that no errors or warnings are occurring on boot, be sure to save and review your event logs.

You also need to define the change-management roles within your enterprise. Outline the method by which specific teams create and submit RFCs. You might assign teams to ITSM functions such as Cost Management, Build and Test, Customer Management, and Change Management. Representatives from these groups can submit an RFC. The other teams would then assess the RFC to determine its effect on each ITSM function. Create a Change Advisory Board (CAB), the responsibility of which is to determine how any RFC will affect availability, capacity, and adherence to SLAs. Make sure that you've established formal procedures for implementing changes and recording new or changed CIs in your CMDB. Also, ensure that the CMDB is available for root-cause analysis of any failures. As appropriate, create new RFCs to address design changes following a failure.

Troubleshooting is a vital step on the path to recovery. The quicker you can detect and fix a problem (i.e., the lower your MTTR), the higher your availability will be. You can expedite system recovery with fast troubleshooting. To provide redundant "safe boot" on failure, place parallel Win2K installations on all servers. (Although you can boot Win2K in safe mode, a parallel installation of Win2K provides another way to repair a system and return it to service.) Create, document, and practice procedures for handling a blue screen, a Dr. Watson message, a hung server, and hung processes. Ensure that support personnel are familiar with these procedures.

Call to Action
Do you require Windows applications that are scalable and available 99.9 percent of the time? If so, Datacenter might be the solution for you. Remember, however, that high availability isn't a product feature. You'll achieve three or more nines only as a result of meticulous design choices and strict adherence to processes that keep systems up.

The Windows Datacenter Program lets you put some necessary high-availability tasks—such as Change and Configuration Management (CCM) and testing of certified hardware, OSs, and applications—into the hands of Microsoft and its OEM partners. But you'll still need to understand the myriad other aspects of designing, implementing, and supporting a high-availability infrastructure. The Windows Datacenter Program and high-availability computing will represent a substantial departure from the way you've managed Microsoft systems in the past. Despite the challenges, the rewards will be tangible and immediate.

End of Article

Prev. page     1 2 3 [4]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Oh no! HP doesn't do datacenter. If you think installing DC on an 8 way server just to satisfy the microsoft's dc programme rules, yes, HP does this very well. The truth is; Unisys ES7000 is the FIRST and ONLY intel based machine that can run W2K DC on 32 CPUs and 64Gb RAM (the figures that you love to mention about).

Yavuz Guceri

 
 

ADS BY GOOGLE