8 real-world examples that satisfy availability and scalability needs
Several years ago, I was an administrator of a large, 24 * 7 IBM shop. I was the guy everybody called in the middle of the night when the system went down. I lost many nights of sleep while trying to diagnose a problem for a wide-awake systems operator. I needed a fail-safe solution that would let me sleep through the night and fix the problem when I got to work the next morning.
Sleeping better is a major reason many of the administrators I interviewed for this article love NT clusters: These administrators can rest peacefully knowing clusters are protecting their systems from failure. The administrators I interviewed said their solution worked as advertised: They can recover their systems with minimal intervention, and most end users are unaware of any failure.
The good news is that NT clustering solutions are available today. I'm not talking about theory; I have case studies to share. The bad news is that some companies implement clustering because NT alone isn't stable enough for their environments. In some cases, clustering is like blue block--maximum blue screen of death protection. Smear on some General Protection Fault (GPF) 30 to protect yourself from those nasty blue rays.
This article looks at how several organizations use NT-based clustering to satisfy availability and scalability needs. The administrators I interviewed for this article used different criteria to evaluate clustering solutions for their unique situation. Based on these interviews, I've concluded that no single solution solves every availability and scalability problem--the market demands a variety of solutions. I hope I've included enough real-life scenarios so you can see an NT clustering solution that might meet your organization's needs.
BlueCross/BlueShield of Oregon
As part of a large, nationwide insurance provider, BlueCross/BlueShield of Oregon combines Citrix WinFrame software with Cubix RemoteServ/IS hardware to create a clustering solution that supports its remote users. BlueCross/BlueShield supports its healthcare facilities, employees, and partners through this connection so users can access centralized billing and patient information. In this configuration, WinFrame provides remote access and Cubix adds availability and load balancing.
Cubix provides clustering within one cabinet, which reduces the need for computer room floor space. BlueCross/
BlueShield configured each of two Cubix cabinets with two dual-processor Pentium Pro systems and one single-processor Pentium Pro system. The Cubix hardware is currently configured to let as many as 15 users dial in simultaneously. However, BlueCross/BlueShield can expand the Cubix system well beyond this configuration. BlueCross/BlueShield plans to replicate this solution as the need arises. The Cubix hardware keeps the cluster load-balanced, and in the event of a failure, the system redirects a user to an available node.
"The management software is really slick. You can instantly see errors reported to the administrator's desktop," said systems administrator David Blackledge. "The Cubix hardware is really easy to maintain, and administrators can support the system from their desk."
Although Blackledge recommends this solution to anyone looking for solid remote access, he would like to see a more flexible licensing model. In a WinFrame environment, the licenses are tied to the processors. If one CPU dies, your licenses might not transfer to the surviving node. Certain licenses can float between processors; however, you must have a minimum of five licenses per motherboard.
Books.com
Books.com claims to be the first Web-based bookstore to offer online purchasing of books, videos, and music. The company went live in 1992, and Books.com now serves more than 60,000 user sessions per day from its clustered Web site.
To update information for its on-line store, Books.com developers make changes to an NT file server at one location. The company uses a T1 connection and Octopus DataStar software to replicate changes to three separate nodes in another location. To cluster and load-balance the three nodes, Books.com uses Convoy Cluster Software on HP NetServers (for information about Convoy Cluster Software, see Jonathan L. Cragle, "Load Balancing Web Servers," page 68).
Figure 1 shows the Books.com cluster network model. This configuration lets each node handle one-third of the user load. When a user visits the Web site, the system combines files from the NT file server with data in Sybase and Oracle databases to dynamically generate the information the user's Web browser displays.
"The most common problem is blue screens on NT Server," said administrator Dennis Anderson. When a Convoy node fails, the other nodes pick up the load, and end users are unaware of any disruption in service. Fortunately, Anderson can Telnet into the Sentry Ambassador remote power-up box and restart the system if a node fails during the night. After the system reboots, the recovered node can rejoin the cluster. Octopus DataStar then uses its journal of changes to synchronize the node. Most of the objects the system replicates to the nodes are small HTML files, so the recovered node usually resynchronizes within 30 seconds after rebooting. The Convoy node then rejoins the cluster about 10 seconds later.
Books.com required load balancing and failover in its clustering solution, so it had to eliminate all but a few solutions from consideration. The company downloaded a demonstration of Convoy Cluster Software from Valence Research's Web site. The demonstration helped seal Books.com's decision. "Convoy Cluster Software performs really well," said Anderson. "You don't really notice it, but it works."
Despite this solution's success, the company has discovered one annoying problem: Convoy can't detect when Internet Information Server (IIS) fails. As a result, when IIS fails, the entire cluster fails. Anderson uses Ipswitch's WhatsUp to work around this problem. Now if IIS fails, WhatsUp stops that node, and Convoy removes it from the cluster and alerts Anderson via pager. Anderson hopes Convoy will detect this type of problem in future versions.
"NT is not a very robust Web serving platform," said Anderson. "NT has a lot of maturing to do." Specifically, Anderson would like Microsoft to focus on reliability.
Celanese
Celanese runs a continuous flow (24 * 7) process to manufacture 1000-pound to 1200-pound (4' * 4' * 4') bails of acetate cellulose toe for producing cigarette filters and suit liners. If the process stops for 1 minute, the bails harden and require a massive cleanup and restart process that can take days.
In the past, Celanese employees had to continually measure the manufacturing equipment (e.g., programmable logic controllers, scales, presses, extrusion devices, sensors, dryers) to determine whether individual bails met the company's strict quality standards. Now, the company has automated the process using Gensym's G2, an NT-based software solution. G2 continually receives measurements from the manufacturing equipment and uses its built-in expert system software to determine the quality of the bails. G2 records quality measurements into its SQL Server database and adjusts equipment as necessary. At specified intervals, the SAP production planning system queries the database for acceptable bails and records them into an Oracle NT database.
So why did Celanese decide to use an NT cluster solution? "We felt like that's where everything was headed. Doing the same thing with UNIX would cost $500,000," said administrator Jim Fraser. "The advantages of having a common platform for our business and manufacturing users are too numerous for the accountants to ignore. I'm not afraid to use NT. I supported HPUX for 7 years, and NT is just as stable as HP."
When Celanese automated its manufacturing process, the company had only one requirement: absolutely no downtime. This requirement let Celanese narrow its search for a clustering solution to one vendor--Marathon Technologies. Marathon's Endurance 4000 software and hardware solution is truly fault tolerant. Both the data and compute nodes are completely redundant. As a result, client machines don't need to restart following a system failure, and the G2 software can't fail. Future versions of the product will support symmetric multiprocessing (SMP) nodes. Endurance 4000 ties four systems (Celanese uses four 200MHz servers) together to create one cluster. Figure 2, page 127, shows the Celanese cluster network model.
Celanese selected Marathon's Endurance 4000 because it's the only solution available with sub-second failover time. In fact, it doesn't really fail over, it just disconnects the redundant node.
Celanese has experienced two hardware failures, and the Marathon cluster worked both times without a hitch. In less than 5 milliseconds, the surviving node took over the load. "Marathon Endurance 4000 is a wonderful solution," said administrator Jim Fraser. "Marathon works hard for its customers."
First Union Capital Markets Group
Corporate email is the mission-critical application of the 90s. Take your mail server offline for a few minutes, and watch your Help desk light up like a Christmas tree. First Union Capital Markets Group in North Carolina uses Microsoft Cluster Server (MSCS) to keep its Exchange and file and print servers running 24 * 7. Previously, the company had to use twice as many clusters to do the same amount of work they do today. "In the old days, I had Compaq standby clusters. Now I use active/active clusters, and both nodes are working," said Sid Vyas, First Union CIO. "I'm saving a huge amount of money on the hardware."
Vyas recommends a single-vendor clustering solution. During the testing phase, First Union unsuccessfully tried to mix and match hardware. Vyas also recommends a fibre channel connection over a SCSI-switching solution for increased throughput on the disk, and a 50 percent faster failover time, and an increased length of cable between nodes (500 meters vs. 25 feet).
Vyas chose Compaq ProLiant servers to run MSCS because Compaq was the only company to certify a fibre channel connection. This configuration lets First Union place its servers and storage in separate buildings and keep nodes in separate data centers on different floors of the building. Distributing the computing resources increases the fault tolerance in case of a disaster.
Prev. page  
[1]
2
next page