DOWNLOAD THE CODE:
Download the Code 41681.zip

Different industries load different kinds of data. Telecommunications companies load high volumes of Call Detail Registration (CDR) records, financial trading systems load trades or stock-ticker information, and Web hosters load IIS log files to support click-stream information. All real-life projects include some form of high-volume data loading. In my work with the Microsoft SQL Server Development Customer Advisory Team, I use some basic best practices to help customers in all kinds of industries implement high-volume data loading.

Maybe you need initial data from an outside system such as a mainframe or UNIX text files. Or maybe you must deal with a continuous stream of data—such as a stock ticker, news updates, or airline arrival and departure information—or a queue that receives messages that you need to load into a database table. Let's review a few basics of data loading and look at how to choose the best data-loading option. Then, let's see how to optimize performance during bulk loading in SQL Server 2000.

Data-Loading Options
When you need to load a large amount of data into SQL Server, you must first consider what kind of loading tool to use. When you load data into SQL Server from an external data source, you can choose between using INSERT statements or using a bulk loader such as BULK INSERT or bulk copy program (BCP). Alternatively, you can use Data Transformation Services (DTS) or XML Bulk Load, but these options, which use either the ODBC BCP API or the OLE DB IRowsetFastLoad interface, are subject to the same rules as BULK INSERT and BCP. I focus on BULK INSERT and BCP in this article because they both come with every edition of SQL Server, including Microsoft SQL Server Desktop Engine (MSDE).

What's Wrong with INSERT? Nothing, but...
In most cases, the preferred mechanism for adding new data into a table is the INSERT statement. However, high-volume data loading presents a performance challenge that you can best address by using streaming techniques rather than the traditional method of sending separate statements for each row or set of rows.

When you send an INSERT statement, you effectively send over the wire metadata and data values for each statement or batch of statements. In streaming, you send the metadata once, followed by a continuous flow (stream) of data values. With streaming, the server can set up the receiving internal structures (such as buffers) one time, according to the data types for each column, and continue to accept values in the structures while the stream transfers rows (data values). With an INSERT statement, the server must repeat the initialization process for each statement. Although INSERT statements can reuse execution plans to lower the cost of setting up structures inside the server, a streaming architecture always outperforms a statement-based data-loading architecture.

BULK INSERT and BCP both offer the ability to stream data into SQL Server, so which bulk loader should you use? Your choice depends on two factors: the number of concurrent sessions you need and whether you must perform remote bulk loads (i.e., loads made from a remote machine over the network into a SQL Server database). The main task of a bulk loader is to turn a data source (e.g., a flat file) into a rowset so that the bulk loader can stream the data into SQL Server. The file parser reads the input file in 64KB chunks by using an asynchronous I/O read pattern. Then, the file parser hands the data to the rowset producer, which creates the rowset and streams it out.

At this point, BULK INSERT and BCP differ. BULK INSERT is an internal T-SQL command that runs inside the SQL Server process and address space. BCP doesn't run inside the SQL Server address space or process; it's a separate process that talks to SQL Server through an interprocess communications (IPC) mechanism. Therefore, BCP can run anywhere on the server or the client. As Figure 1, page 37, shows, BULK INSERT sends the rowset directly to the query processor. BCP, however, uses the Tabular Data Stream (TDS) application-level protocol to send the rowset over the wire. SQL Server receives the TDS rowset and turns it into an OLE DB rowset for the query processor.

You can use the SQL Trace tool to confirm this rowset conversion. For example, in Query Analyzer or through OSQL, run the following T-SQL BULK INSERT statement:

BULK INSERT [sales] FROM 
  'f:\data\sales.char'
WITH (batchsize = 10000, 
  datafiletype = 'char', tablock)
   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Confused! Point 4 says to ensure the table has NO indexes to enable minimal logging, yet then this sentence appears: "When you're loading into a table that contains a clustered index, determining whether you're using minimal logging is more difficult." If point 4 is right, it can't use minimal logging. But the results seem to imply minimal logging is possible with a clustered index, thus proving point 4 wrong.

This is a VITAL point to clarify. I'm currently dropping clustered indexes to get minimal logging, but I think it's hurting performance to do so. Should I leave the clustered indexes in place?

Vince Iacoboni

It's a shame that you dont show the charts that appear in the print article.

Peter Ellis

Article Rating 4 out of 5

The BCP command in the article omits the table lock load hint. It should read as : BCP.EXE testdb.dbo.sales IN f:\data\sales.char -S (local)\dev -T -c -b 10000 -h "tablock"

Peter Ellis

Article Rating 4 out of 5