• subscribe
November 26, 2002 12:00 AM

Divide and Conquer

Architectural redesign + transactional replication = smashing performance
SQL Server Pro
InstantDoc ID #27123

Implementing transactional replication. We chose transactional replication for this project for several reasons. Because the reporting queries were read-only, we didn't need to implement bidirectional replication. In addition, transactional replication creates much less overhead on the publisher than merge replication (because merge replication is trigger-based, and triggers claim significant overhead). Also, we didn't want the added DBA workload that merge replication generates. In implementing transactional replication, we took great care to coordinate the replication with the application modifications. We especially needed to make sure that all the necessary tables and stored procedures were replicated. This effort required a lot of coordination and good communication between the development and DBA groups.

Initially, we implemented the replication before making the application modifications so that we could test the replication's performance and configuration. During testing, we determined that the replication's performance was unacceptable. The default value of the polling interval for both the Log Reader Agent and the Distribution Agent was 10 seconds, so a transaction could take as long as 20 seconds to move from publisher to subscriber. That replication delay was unacceptable because the users were accustomed to realtime or near realtime reports. By modifying the polling intervals for both the Log Reader Agent and the Distribution Agent to 2 seconds, we shortened that time to a maximum of 4 seconds, an acceptable performance level.

We discovered one important architectural design problem while we were incorporating transactional replication into the change-control process. By default, we put all the replicated objects into one publication. Although this process made administration easier, it also made change control difficult. Call-center business rules (and hence the reports) often change, so we had to alter the corresponding stored procedures and schemas. Each time we made a change that affected a replicated object, we had to break and recreate the publication. The solution was to make each object a separate publication, so that now when we change an object, all we have to do is modify a specific publication and approximately 100 subscriptions. We don't have to create a snapshot for the entire publication set, only the affected objects. Granted, we have more publications to manage, but administration is much easier in the long run.

Testing and tuning the solution. Every time we moved a new reporting query to the replicated servers, we tested it thoroughly. Because a report might access 10 to 20 tables and call several stored procedures, you might easily move a report that's missing an underlying table or stored procedure. To avoid this possibility, we completed a full application regression test (by testing all available reporting options) every time we made a change to the replication model. In addition, because we were running only reporting queries on these servers, we tuned the indexes only for reporting queries and not OLTP. Using this process, we could remove some indexes and modify others for better performance. We still use this process. To keep the data and indexes in an optimum state, we also run DBCC CHECKDB and DBCC REINDEX every week.

Maintaining the distributed system. Maintaining the system became slightly more difficult as we added servers. We needed to administer a distributor and two or more subscribers in addition to the main server. To properly and efficiently handle change requests and monitor the overall health and availability of the replication publications, we found we needed to allocate at least 30 percent of two senior DBAs' time to administering the replication architecture. Part of their time is devoted to monitoring discrepancy thresholds. At the time we implemented the redesign, we established discrepancy thresholds for each table publication. Now, SQL Server compares subscriber table row counts against the row counts in the main server, and when the established discrepancy threshold is exceeded, SQL Server sends an urgent notification to the database team.

During the testing process, we decided that because of business rules, we needed to replicate user accounts and passwords to the subscribers. Transactional replication can't replicate this user information, so we had to develop our own method of replicating it by using Data Transformation Services (DTS) packages. Through the SQL Server Agent, SQL Server executes the DTS packages to update the login and user information every 5 minutes. The most typical changes to user information are changing passwords, changing roles (an application-security feature to group users for different types of data access), and adding or deactivating users.

Early on, we decided that because the reports running on the day-old server required the entire 40GB of the call center's data, running snapshot replication on that server wasn't feasible because of the time that snapshot replication must hold a table lock on the replicated table. In addition, because of primary key problems, transactional replication also wasn't feasible. Instead, we decided to maintain the day-old server through a process of backup and restore. This process provides several advantages to the data center such as higher performance than snapshot replication, and we would have performed a backup anyway. An additional advantage of this process is that we validate the backups immediately after they're created by restoring them.

Maintaining the day-old server is a delicate balance of automated and manual tasks. The most crucial objective is to ensure that the day-old server is current up to midnight of the previous day. The secondary objective is to complete the backup-and-restore process in time to put the system back online before the next business day. The process, which requires night operators and on-call DBAs to ensure successful completion, follows:

  1. The system operators reload the databases on the day-old server each night from the most current production backup.
  2. The system operators restore production transaction-log backups on the day-old server after the database restores are completed to bring the database as up-to-date as possible.
  3. If the day-old server isn't online before the start of the next business day, the DBAs point the day-old reports back to the main server until the day-old server is again online.
  4. If the day-old server can't be brought online within a reasonable amount of time (approximately 2 hours) after business hours begin, the day-old reports remain pointed to the main server until the day-old server can be refreshed during the normal schedule that evening.

Results
After implementing the replicated systems and slowly migrating more reporting queries off the main server, we've seen a dramatic decrease in CPU utilization as well as an increase in performance. Figure 3 shows the drop in CPU utilization compared to the spike of the 6 previous months. (We implemented the replication in the latter part of 2000.) Although SQL Server replicates a large amount of data very quickly, the CPU resources used on the distributor are fairly low because transactional replication is highly efficient. The reduction in CPU utilization occurs even as we continually add users to the server. Figure 4 displays user counts.

The resources that the reporting servers use are also within acceptable limits. We constantly monitor these resources and set an emergency plan in motion if they begin to rise to unacceptable levels. We simply add more subscribers and update the pointer table to point reports to those servers, then distribute the load further without the users ever knowing this activity is occurring—and without making additional changes to the applications. Another way of measuring our success is in the response time of certain queries. Figure 5 shows an example of one of those queries. As you can see, this query posted a 500 percent performance gain after we implemented replication.

Although, in general, this solution is successful, it carries a few limitations. Some problems with replication require rebuilding a publication and resyncing the data using a fresh snapshot. Also, because of locking problems, we can't always create new snapshots during normal business hours while the system is busy.

Innovative and Scalable
A successful project is always the result of several contributing factors. We can't emphasize enough the importance of good coordination between the development and DBA teams. You also need excellent system-performance metrics, long-term system-resource monitoring, and an application equipped for response-time measurements. Defining metrics early in the process is crucial so that you can assess your degree of success. Most importantly for our project, redirecting the data flow solved our major problem and opened extra avenues of improvement. Implementing transactional replication to off-load reporting queries achieved two goals. It reduced the CPU utilization on our main server, thereby increasing CPU performance, and it left us free to add more reporting servers as we added new users. This scalable, robust solution should last for years to come.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here