• subscribe
November 26, 2002 12:00 AM

Divide and Conquer

Architectural redesign + transactional replication = smashing performance
SQL Server Pro
InstantDoc ID #27123

DBAs know that the first step in determining the long-term viability and capacity of any system is to implement performance-monitoring metrics. Mindful of that imperative, several years ago we devised a program to keep tabs on performance changes in a call center's processing system that runs Windows NT and SQL Server 7.0. Initially, we found that the Performance Monitor showed us current performance metrics but didn't give us insight into long-term trends and capacity changes. So to collect data over time, we set up long-term monitoring by taking snapshots every half hour of performance metrics such as CPU and I/O utilization, numbers of users and processes, order processes, and transactions. Then we stored the metrics in a SQL Server database. The year we began our monitoring program, we collected data for only a few consecutive months. This exercise showed us that data gleaned from such a short time period produced only short-term information—too few data points to determine a trend. However, the following year, when we monitored for longer time frames, the data we gathered revealed significant trends. Over a year's time, we observed an alarming elevation in CPU utilization that we attributed to an increase in the center's custom application features and the addition of many system users.

First Attempts
To stem the rise in CPU utilization, we repeatedly upgraded the hardware until SQL Server 7.0 was running on the largest and fastest CPU available. Next, we turned to a comprehensive program of tuning the database, hardware, and SQL statements. Although performance tuning became an ongoing process, demands on the CPU ultimately outpaced our optimization efforts. In July of 2000, we determined that the current database system would likely reach full performance capacity by year's end. Figure 1 shows actual CPU utilization from July through December 2000. Rewriting the applications would take much longer than a year, and we weren't certain that modified applications would solve the problem even then. We had to take more drastic measures.

Off-Loading Reporting Functions
Transactional replication seemed to be the answer because our goal was to reduce the load on the publisher and because the reporting queries didn't perform database updates. However, we decided that replicating the entire database wasn't feasible. Because the database exceeded 40GB, the amount of time that the Snapshot Agent would need to create a snapshot was prohibitive and would require more hardware on the distributor and subscribers. In addition, because all tables participating in transactional replication must contain a primary key, we'd need to modify the database.

The final solution hinged on the application's custom stored procedures. During our performance tuning, we painstakingly captured and analyzed SQL traces and metric data to identify stored procedures and ad hoc SQL code in the call center's application that were taking the longest to run and using the most CPU capacity. We assigned the stored procedures and SQL statements to two categories that each required a different optimization technique: short-running online transaction processing (OLTP) queries and long-running, read-only reporting queries. The read-only queries were perfect for off-loading (and therefore reducing the drain on the CPU). We considered diverting the read-only queries to a data warehouse or data mart but rejected that option because they were an integral part of the application and needed to run in near realtime. However, we realized that we could replicate the portions of the database that these queries used and divert them to secondary servers. This solution not only increased the available system capacity in the short term by letting us off-load work to subscribers, but it also proved to be an innovative and scalable solution for the future. In the latter half of 2000, we began to off-load some of the reporting functions to replicated servers.

Project Steps
Once we determined the solution, designing the project became our first task. The project unfolded in the following five stages.

Redesigning the architecture. During the initial system redesign, we identified three important criteria: the modified architecture should be dynamic, fault-tolerant, and scalable. To make the call center's custom application more dynamic, at first we programmed it to determine at startup time which server would run a particular reporting function. However, in that scenario, if a reporting server failed after application startup, the reporting queries needed to be rerouted so that they would still run. And if the server location wasn't set during startup, the only way to correctly determine where to reroute the reporting functions was to restart the application. That approach wasn't dynamic enough, so we placed a lookup table on the main server. This table contains a list of reporting queries and the name of the server on which these reporting queries run. Now, when a user submits a report, the application queries the lookup table and runs the report on the server that the entry in the lookup table names. Although accessing a lookup table costs overhead, the performance gains more than offset the drain.

Scalability was the next criterion we addressed. We divided the reporting functions into three categories according to their business purposes, then assigned each category to a corresponding server. The categories specify how current the data must be—realtime, near realtime, or day-old. Realtime reporting queries must run on the main server because the nature of the queries depends on the instantaneous state of the database. Near realtime functions can tolerate the 8-second delay that replication requires, so we assigned this group to a second server, which is about 20 percent of the main database size. Day-old reporting functions involve long-term data, so we decided to keep an image of yesterday's data on a third server. This architectural redesign reduced the amount of data that we needed to replicate and provided lots of scalability by letting us spread out functions among multiple systems.

One advantage of the redesign is that all reports that access the near realtime server use the same data. Therefore, we can spread multiple reports across multiple subscribers. Although the main server is a Microsoft Cluster Server (MSCS), we haven't needed to cluster the reporting servers. If a reporting server fails or a replication error occurs, we can quickly and easily modify the application lookup table to switch subscribers or point back to the main server for a short time. Before the redesign, we had to change subscribers or send the reports to the main server manually. Using multiple subscribers for the same reports gives us the added benefit of fault tolerance. Because transactional replication incurs the same overhead on the publisher regardless of the number of subscribers, we can implement two reporting servers for much less overhead and cost than one clustered reporting server. The application can continue to run, even when one reporting server fails. Figure 2 shows the architectural redesign.

Modifying the application. Because we developed and maintained the application in-house, modifying it was feasible. However, modification was still an enormous task because we had to analyze 180 reports to determine which reporting category they belonged in: realtime, near realtime, or day-old. The application had to be able to determine the server on which each of the replicated functions should be run by reading the lookup table, connecting to the appropriate server, moving the data to that server, then disconnecting when the reporting query had finished. As we continue to enhance the application, we modify most reporting queries to use this lookup table, even though they might initially still point to the main server. Using this lookup table gives us greater flexibility in routing the queries, the ability to move them at a later time, and scalability for future reporting requirements.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here
  • SP1?
    I know there is a SP1 for SQL 2008 R2 available....and there is a "feature pack" as well... ...
  • SQL database mirroring
    I have SQL Server 2008 R2 Enterprise 64bit on Windows 2008 R2 Enterprise 64bit.  Each SQL Server has...
  • Dell Compellent Disk Drive
    Does anybody has experience with Dell Compellent Disk Drive? Basically, this system manages all disk...
  • Sql server performance tuning
    I need to find a tool that help me to optimize sql server,queries,improve the performance and solve ...