Clear the way for large-scale snapshot replication

Editor's Note: This article is the first of a three-part series showing how to optimize snapshot replication. It shows how to evaluate snapshot replication's performance and lays out the steps for optimizing the performance. The second and third parts of the series will detail the full implementation of these optimization steps.

If you're an application or database developer, you're probably familiar with the excessive length of time that utilities such as bulk copy program (bcp), Data Transformation Services (DTS), and snapshot replication need to copy a massive amount of data to multiple locations. For example, a snapshot of a database that holds 500MB of data defined with key constraints and indexes might take 2 hours during the production day or 1.5 hours at night to transfer between two 4-way, high-end database servers with 1GB of RAM and RAID 5. This performance is unacceptable in a 24*7 load-balanced server farm or a distributed enterprise data warehouse whose available off-peak time is too short to accommodate the transfer of multiple massive data sets to all servers. So how can you speed up your data distribution?

I choose to optimize the replication of a large snapshot of data. Snapshot replication copies a set of data out of a source SQL Server at the moment when the replication is invoked, even while other queries or updates exist on the same set of data. The snapshot replication then copies this set of data, immediately or at a later time, into one or multiple target SQL Servers. The target servers see the data as it existed in the source server at the moment of copy until the next snapshot replication refreshes the entire set of data. DTS copies a set of data out of an ODBC (or OLE DB) source and into another ODBC (or OLE DB) target immediately. The source and the target can be two completely different ODBC data providers. Bcp copies one SQL Server table or view out to a native- or character-formatted file, or it copies one native- or character-formatted file into a SQL Server table or view.

Snapshot replication actually uses bcp and takes place in two stages. The first stage copies (bcps out) a snapshot of a selected data source at a scheduled time. The second stage distributes (bcps in) the snapshot to one or multiple target servers on separate schedules. Although the DTS Copy SQL Server Objects task also uses bcp by supporting the BULK INSERT task, this task can't schedule copy out and copy in separately, and it lacks the efficiency of sharing one snapshot for multiple targets. A DTS job can copy to only one target. However, bcp is a raw command-line utility. It lacks a user-friendly interface and the capability of copying from one SQL Server to another SQL Server directly. Each bcp execution can copy only one table at a time. Snapshot replication and DTS are user-friendly and functionally powerful alternatives to the bcp utility. (To find out why snapshot replication might be more appropriate than transactional replication for your project, see the sidebar "Why Not Transactional Replication?" page 46.

How Snapshot Replication Works
To lay the groundwork for the optimization process, let's briefly review the basics of snapshot replication, which replicates data in two stages. In the first stage, the Snapshot Agent copies out selected tables (called published articles) from the source database (the publisher) to a distribution area (the distributor) as a set of .bcp files by running the snapshot.exe command-line utility. Note that in SQL Server 2000, you can also define an article based on a view, stored procedure, or user-defined function (UDF).

In the second stage, the Distribution Agent copies the set of .bcp files into the target databases (the subscribers) by executing the distrib.exe command-line utility. The snapshot.exe and distrib.exe utilities use bcp to copy the articles one at a time. The Snapshot Agent copies the data out to .bcp files by executing a simple SELECT statement that might contain a column filter and a WHERE clause (to filter rows). Then, the Snapshot Agent writes the result set to .bcp files. When the Distribution Agent copies (inserts) .bcp files into database tables, it has to delete the existing rows, possibly log the deletions and insertions, and maintain the table indexes.

Before you begin this two-stage process for replicating your data, you need to establish a framework for it. Let's look at the default steps for setting up snapshot replication.

Defining and configuring the publications and subscriptions. You can easily define and configure snapshot replication by using Enterprise Manager or system stored procedures. Typically, you create a new publication by first selecting the articles you want to include, then setting properties for the publication and the articles. Next, you use Enterprise Manager to add subscriptions, one at a time, that subscribe to all articles of the publication. Alternatively, you can use the sp_addsubscription stored procedure to select only those publication articles you want to subscribe to or all the articles. You call the procedure by setting the @article parameter to each article name, one at a time, or by specifying all for all articles. For example, in a Query Analyzer window, you issue the call

EXEC sp_addsubscription @publication = N'pubs',
@article = N'authors', @subscriber = N'CGIS1',
@destination_db = N'pubs', @sync_type = 
N'automatic', @update_mode = N'read only',
@offloadagent = 0, @dts_package_location = N'distributor'
GO

to subscribe only to the authors article for subscriber CGIS1. If you set @article = N'all', you subscribe to all articles. The subscription can be a push or a pull subscription. A push subscription is created at the publisher and is suitable for a tightly integrated, connected, and centrally administrated environment, such as a development server (the publisher) connected to a production server farm (the subscribers) in a data center. (The Distribution Agent in this case runs at the distributor.) A pull subscription is created at the subscriber and is best for a loosely coupled, connectionless, more autonomous relationship, such as that between a traveling salesperson's laptop and the office server. (In this case, you off-load the Distribution Agent to run at the subscriber.) The optimization technique I describe in this article applies to both pull and push subscriptions.

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.