Which SQL Server data pump gets data in and out faster?
Every SQL Server application manager needs to move a large amount of data into or out of a SQL Server database at least once, so SQL Server has several tools for bulk data-transfer operations. In this article, I compare two familiar SQL Server utility programsbulk copy program (bcp) and Data Transformation Services (DTS)in a common scenario: moving data to and from an ASCII text file. I compare the performance and capabilities of these utilities to the T-SQL BULK INSERT statement, which, according to Microsoft, is the fastest bulk copy method (see "Data Import/Export Architecture" in SQL Server 2000 Books Online BOL for more information).
Bulk Copy Architecture
SQL Server has several underlying APIs. Some of themthe original DB-Library API, OLE DB Provider for SQL Server, and the ODBC driverinclude a set of bulk copy functions. In SQL Server 6.5 and earlier, the bcp utility uses the DB-Library API. In SQL Server 7.0, Microsoft introduced a new version of bcp that uses the ODBC bulk copy API instead to support new data types, which DB-Library doesn't support. (Microsoft has stabilized DB-Library at its current level of functionality, so the company won't be adding any new features.) Ultimately, SQL Server's storage engine handles calls from each of these APIs. Figure 1, page 38, shows an overview of SQL Server 2000's bulk copy architecture.
Within the storage engine, the T-SQL BULK INSERT statement, bcp, and the DTS Bulk Insert task all execute the same code. The T-SQL BULK INSERT statement can potentially perform faster than the others because BULK INSERT executes completely within the SQL Server process. Bcp and DTS, utility programs that execute in their own processes, must bear the overhead of interprocess communications to pass data from the source text file to SQL Server. When you run bcp or DTS on a client computer rather than on the SQL Server system, the need to copy the data across a network connection adds significant overhead and further slows the bulk copy operation.
In my test results, the potential advantage of the T-SQL BULK INSERT statement didn't always translate into faster loading. According to Microsoft Product Support Services (PSS), the advantage is significant with very large tables (Microsoft routinely tests with 50 million-row tables), when you load multiple tables into a database in parallel, and when you use file groups to segregate onto different disk volumes the tables that you plan to load in parallel. Other factors also influence the overall performance of a bulk copy operation: Floating-point fields are faster than character fields, and files created to store data in SQL Server's native, internal, binary data format import more quickly than files in external ASCII character format.
Creating Test Files
I created the database and defined all the objects for my test by using a T-SQL script in SQL Server Query Analyzer. Then, I backed up this empty database to create a reproducible starting point for my tests. Using a test data generator for a table with 608-byte rows (which consist of a combination of varchar and integer data types), I generated 5000-row, 100,000-row, and 1 million-row tables. I used the DTS Import/Export Wizard to export each table to text files in Comma Separated Values (CSV) format and fixed-field format. These text files became my data standard for DTS testing.
Because bcp doesn't directly create text files in the same CSV format that DTS can generate (i.e., bcp doesn't offer a simple way to automatically create text data files with comma-delimited fields and quoted text fields), I created a second set of text files for bcp and BULK INSERT testing. I used DTS to load the CSV-format data files and bcp to export them to both bcp's variable-width-field format and a fixed-field format. I used the variable-width-field format files for bcp and BULK INSERT testing.
My test system consisted of SQL Server 2000 Enterprise Edition running on Windows 2000 Advanced Server with Service Pack 1 (SP1). My test platform was a Compaq ProLiant 7000 with two 500MHz Intel Pentium II processors and 512MB of DRAM. My system included a Compaq Smart Array 3100ES Controller. I put the SQL Server data on a 12-drive RAID 0 array, and I put log files on a separate 3-drive RAID 0 array.
Using DTS
DTS and its GUI made bulk copying between SQL Server and a text file easy. SQL Server's DTS wizard uses the Transform Data task to simplify moving data. The DTS Package Designer, which provides DTS features that the wizard doesn't expose, lets you create complex procedures.
The DTS Transform Data task lets you modify the data-field format during the import or export process. DTS also exposes the functionality of the more efficient T-SQL BULK INSERT statement in the DTS Bulk Insert task. I used the DTS wizard for most of my DTS testing, engaging the Bulk Insert task only to verify that its performance was equivalent to using the T-SQL BULK INSERT statement in Query Analyzer. The DTS wizard makes exporting a table easy; you simply select the source database and table, click Text File output, provide a filename, and choose whether you want a fixed-field or comma-delimited output file.
Importing the data from the comma-delimited file was just as easy as exporting because the text-file format matched the table format. When the first row in the text file contains the column names of the destination table (a check-box selection), the column order in the text file becomes unimportant. Under those conditions, the import will tolerate missing columns that allow NULLs and columns defined with a default constraint. Importing data from a fixed-field format file, which you create by using DTS to export the table in a fixed-field format, was just slightly more complex. DTS detects column positions that contain data preceded by a blank and places default column breaks in those positions. My data contained some such column positions that weren't the beginning of new data columns and, thus, by default, defined more data columns in the input file than the output table had. I double-clicked the extraneous column breaks to remove them, and the import proceeded without error.
The DTS Import/Export Wizard enables the Use fast load option by default. This option, available only when your system uses the SQL Server OLE DB Provider (SQLOLEDB), calls that provider's IRowsetFastLoad API to more efficiently handle simple bulk copy operations.
DTS has many functions that bcp doesn't provide, including facilities for moving data between SQL Server databases, between SQL Server and non-SQL Server databases, and between two ODBC data sources. In addition, DTS contains functions for modifying and transforming data fields during the bulk copy operation.
Using bcp
You can choose between two modes of bcp operation: interactive and noninteractive. In interactive mode, a series of prompts asks you to describe the text file's format to bcp. The bcp operation can save this information in a text-based format file for future use. In addition, bcp supports several standard file formats: standard ASCII character data files, Unicode character data files, SQL Server native data formats, and a combination of the last two. When using one of these default data types, bcp doesn't prompt for information; it uses a set of default values and optionally creates a corresponding format file. Similarly, when you specify a format file that describes the text data file's layout, you can run bcp in noninteractive mode.
The format file describes the layout of the data file you want to import or export and defines the correspondence between data file fields and SQL Server table columns. The format file gives you much power and flexibility in the data file's physical format and collation. (For more information about the data file's layout, see the sidebar "Format File Overview," page 44.)
After you have a working format file, bcp is easy to use. I created a series of simple .bat files, one for each of my tests, because bcp runs as a command-line utility. On the command line, I specified the database and table names, as well as IN or OUT to designate the copy direction. I also supplied the format-file name, the data-file name, and a user ID and password to authenticate access to the SQL Server table. The bcp operation has options that I didn't use, including the ability to specify SQL Server bulk copy hints.
Using T-SQL BULK INSERT
The BULK INSERT statement exposes bcp's functionality in a T-SQL statement. BULK INSERT moves data from a file to a SQL Server table. However, BULK INSERT can't export data from a SQL Server table to a file. The T-SQL statement relies on a bcp-style format file to determine the data file's structure. In my BULK INSERT testing, I used the same data and format files that I employed to test bcp. With these inputs, I found that using the BULK INSERT statement from Query Analyzer was no more difficult than using bcp. Then, I created a T-SQL script to execute the BULK INSERT statement and calculate how long it took to run.
T-SQL BULK INSERT supports most of the same options and the format file that bcp does. BULK INSERT lacks bcp's ability to create an error file containing the rows that failed to insert into the table. The T-SQL statement isn't compatible with data files that earlier versions of bcp created; these data files might contain field formats that SQL Server 2000 doesn't support. BULK INSERT also lacks bcp's ability to use regional formats when bulk copying currency, date, and time data. Finally, BULK INSERT lacks bcp's ability to set the network packet size for bulk copy operations to remote SQL Server systems.
Test Results
Because the length of time any particular import or export operation takes to complete depends so much on SQL Server's hardware configuration, I've chosen to represent the times in the test result graphs as a percentage of the longest import or export time (i.e., import test results are relative to the length of time DTS took to import 1 million ANSI-padded records, and export test results are relative to the length of time DTS took to export 1 million ANSI-padded records). I recorded the relative time the various bulk copy operations required to export 5000, 100,000, and 1 million records; Graph 1, page 42, shows the export comparison for 1 million records. I also recorded the relative time the various bulk copy operations required to import 5000, 100,000, and 1 million records; Graph 2, page 42, shows the import comparison for 1 million records. (The graphs shown are representative of all the results.) Because BOL states that T-SQL BULK INSERT is the fastest of the bulk copy methods, I expected BULK INSERT to win the race in my tests. But bcp and BULK INSERT finished in a virtual dead heatand DTS wasn't far behind.
Some result trends are obvious (e.g., the more bytes SQL Server must process, the longer it takes). What surprised me was how much longer SQL Server took to import the fixed-length data filein which each varchar field is padded with blanks to the full 50-byte column lengththan it took to import the variable-length data file. Although the number of records, columns, and keys SQL Server needed to insert remained constant, the number of bytes it processed correlates strongly with the length of time SQL Server required to complete the task.
For example, using the 1 million-row database, the DTS fixed-format data file was 2.7 times the size of the CSV file and took 2.5 times as long to import. By averaging the bcp and T-SQL BULK INSERT import times together, I found that the fixed-format data file was 3 times the size of the variable-length file and took 2.7 times as long to import.