| Executive Summary: If you have a lot of data to import into or export out of a SQL Server table, here are some tips that can save you a lot of time and aggravation. |
So here I was, writing some SQL Server 2005 Reporting Services (SSRS) reports for a user when a coworker asked if I could process some data for him. I’m the SQL Server guy at work, so I immediately volunteered to help.
The data arrived several days later on a 1TB external hard drive that contained 20,000 flat files taking up 400GB of space. The data consisted of latitude and longitude coordinates and three extra informational fields. I had no idea how many records the raw data represented because not all the latitude and longitude coordinates were to be included. For latitude and longitude coordinates to be included, their extra fields had to match certain criteria.
The first step was to reduce the record size to the absolute bare minimum. I discovered that the latitude and longitude coordinates were in 1/120-degree increments, so I rounded those floating-point values to small integers by multiplying them by 120 and truncating the remainder. The remaining fields were also rounded after multiplication, thus reducing my final record size to 9 bytes.
Unfortunately, there was no way of directly importing the data into SQL Server because the flat files were in a proprietary data format. So, I wrote a Visual Basic .NET 2.0 application that iterated through each file and created a pipe-delimited output file that I could subsequently import into SQL Server using the BULK INSERT statement. I chose to bulk insert the records because submitting them through ADO.NET would be much slower.
The final piece of the puzzle was to define a clustered index on the table consisting of the latitude and longitude fields. I hoped this would give me instant access to all related data once lookups began. I knew that inserting unsorted records into a table with a defined clustered index would be slow, but I knew of no way to presort the records, so I went for it.
The import operation started off well. A half million records were being imported in a single shot every 10 seconds. So, I let the import operation run a few hours while I attended to some other tasks. When I came back, I discovered that my upload speed had drastically decreased. As my table of imported values grew, the cost of inserting more records into the clustered index became prohibitive to the point where my ADO.NET connection was timing out after 10 minutes—a far cry from the 10-second imports at the beginning. Clearly, I couldn’t import the data into a table with a defined clustered index.
Plan B was simple: I would remove the clustered index and run the import operation again. My thinking was that applying the index after importing the data would take much less time. After three days of importing, my data had been migrated from thousands of flat files into a single SQL Server table. It consisted of 16 billion rows and took up 250GB of space. The final step was to apply the clustered index. I wrote a script and set it running with fingers crossed.
About 28 hours later, SQL Server had sucked up the 750GB of disk space left and began rolling back the index change. After another 24 hours, the rollback had completed and I was back to square one. As it turns out, applying a clustered index takes a lot of disk space. Plus, the entire operation is logged. Not only did my database balloon to 500GB but so did my log file.
I considered several options at this point. The first was to define a partitioned table that would make subsequent indexing easier. Another was to physically split the database between multiple tables. Then, while doing some research on BULK INSERT, I came across an interesting detail: When bulk inserting data into a table with a defined clustered index, you can greatly speed things up by supplying already sorted data. All you have to do is use BULK INSERT’s ORDER parameter to specify an order hint telling SQL Server that the incoming data is ordered like the index. So, I came up with a new plan: Export the entire table of 16 billion rows to disk, sort the data, and use BULK INSERT with the ORDER parameter to re-import the sorted data into a table whose clustered index is already defined.
To export the data from SQL Server to a flat file, I tried using BCP because it’s fast. Unfortunately, BCP maintains a count of the records exported, which it prints in the DOS shell. I say “unfortunately” because BCP’s counter variable is a 4-byte integer that stops on row 2,147,483,647. With only an eighth of my database exported, I had to look for other options.
I ended up writing a SQL CLR procedure in C# that exported my table, row by row, to a flat file on the server. I defined my counter variable as a LONG data type to avoid the problem I encountered with BCP. I ran this procedure, and 19 hours later I had a 141GB flat file in binary format ready to be sorted.
On the Internet, I discovered Ordinal Technologies’ NSort (www.ordinal.com), which I could use for sorting. It promises to sort a terabyte of data in 33 minutes, given sufficient God-like hardware. NSort was a bit finicky to set up, but with help from its developer, I managed to sort my data file in 3.5 hours. The developer later told me that had I used a larger memory parameter, it could have taken only 2.5 hours.