• subscribe
August 22, 2007 12:00 AM

Debunking the Myths of Temporary Objects, Part 2

Concrete examples help clarify the distinction between object types
SQL Server Pro
InstantDoc ID #96479
Downloads
96479.zip

Last month, in "Debunking the Myths of Temporary Objects, Part 1" (InstantDoc ID 96288), I began a discussion about temporary objects, which include temporary tables, table variables, and table expressions. I gave you an overview of each object type, described its characteristics, and provided recommendations for when to use each. This month, in the interest of making last month's discussion more clear, I dig deeper by giving some concrete examples that will clarify that discussion. Before you read this month's article, be sure that you've read last month's article in preparation: The information in that article is an important prerequisite for understanding this article's examples.

Relying on Distribution Statistics
In this article's examples, I'll use a table called Orders with 1,000,000 rows. To create the Orders table in the tempdb database (you can also create it in a user database of your choice), run the code in Web Listing 1 (http://www.sqlmag.com, InstantDoc ID 96478), and populate it with sample data. The fact that I'm creating the table in tempdb has nothing to do with temporary objects. I'm using tempdb simply because I want the table to be cleared upon SQL Server restart. I often use tempdb for test purposes.

My first example involves the need to aggregate order data by the hour, calculating the total quantity and number of orders per hour, and to submit multiple queries against the aggregated data. This scenario precludes table expressions, leaving you with the choice of whether to use a temporary table or a table variable.

Temporary table approach. Run the code in Web Listing 2 to create and populate a temporary table called #OrderTotalsByHour with the hourly aggregates. With four years worth of data, you should have more than 35,000 rows in the temporary table. Web Listing 3 shows the set of queries that you need to submit against #OrderTotalsByHour. Note that in queries 3 and 4, I specify a constant in the filter that represents the highest qty column value from the temporary table (in my case, 2880). The reason I use that value is that there should be a very small number of rows with an equal-to or greater-than value (i.e., highly selective), and a very large number of rows with smaller values in the qty column (i.e., low selectivity). I've used random values to populate the Orders table, so before you run these queries and analyze their plans, first obtain the maximum hourly quantity value from the temporary table and replace the constant 2880 in queries 3 and 4 with the value you get from the following query:

SELECT MAX(qty) FROM
  #OrderTotalsByHour;

Web Figure 1 shows the execution plans I got for the four queries in Web Listing 3. If you want to analyze the execution plans, enable the Include Actual Execution Plan option in SQL Server Management Studio, and run the code in Web Listing 3.

Notice that all queries get ideal plans. Observe the estimated number of rows and the actual number of rows reported in the plans. The selectivity estimations that the optimizer makes are very close to reality. Queries 1 and 2 have trivial plans because in both cases there's a clustered index on the filtered column. In this case, distribution statistics aren't very important because the optimizer would figure out the ideal plan without the need to estimate selectivity anyway. However, with queries 3 and 4, selectivity estimations are very important. I can see that when the filter was highly selective (query 3), the optimizer chose to use the nonclustered index created on the qty column. When the filter had low selectivity (query 4), the optimizer instead opted for a full clustered index scan. The reason the optimizer made good choices is because it could make reasonable selectivity estimations based on the availability of distribution statistics.

In my case, query 3 ended up doing only five logical reads (i.e., seek in the non-clustered index, plus lookup). Query 4 involved 1006 logical reads— the number of pages at the leaf of the clustered index (all data pages). Had the optimizer chosen to use a full clustered index scan with query 3, it would cost 1006 reads. Had the optimizer chosen to use the nonclustered index with query 4, it would have resulted in more than 35,000 lookup operations (each of which is a seek in the clustered index), which would have amounted to substantially more I/O than performing a full clustered index scan. In short, the optimizer made good choices.

Table variable approach. Next, run the code in Web Listing 4 to test the table variable approach and observe the execution plans that you see in Web Figure 2. (Be sure to replace the constant in queries 3 and 4.) Notice that with all four queries, the estimated number of rows returned from querying the table variable is always 1. The reason should be clear by now: The optimizer doesn't have distribution statistics, so it can't make reasonable selectivity estimation. Queries 1 and 2, as I mentioned earlier, have trivial plans, so you get good plans regardless of whether distribution statistics are available. Queries 3 and 4 have different ideal plans for different selectivity scenarios. The optimizer can't make good estimations, so it uses a conservative approach for a range filter—namely, a full scan of the clustered index. With query 3, the choice of execution plan is unfortunate because you end up paying a lot more I/O than what you potentially could. With query 4, you can consider yourself lucky that the conservative approach the optimizer chose happens to fit this case. Sometimes, guesses end up being right.

Using SELECT INTO
Last month, I mentioned that you can use a SELECT INTO statement to create and populate a temporary table, but not a table variable. SELECT INTO is a bulk operation that is minimally logged when the database recovery model isn't set to FULL. Because the tempdb database's recovery model is SIMPLE, you get minimal logging when you use SELECT INTO. "Minimal logging" means that only the addresses of the extents allocated during the operation are recorded in the log to support a rollback, if necessary. The address of an extent is six bytes in size, as opposed to the actual extent size, which is 64KB. When populating a temporary table with SELECT INTO, you get much less logging compared with using INSERT INTO. With table variables, you don't have the option to use SELECT INTO, so you'll end up with more logging, resulting in slower population of the table variable, compared with populating a temporary table with SELECT INTO. Ready for a tangible example?

The code in Web Listing 1 creates a function called fn_nums, which returns a table result with numbers in the range of 1 through the requested number of rows. I'll use it to populate temporary objects with 1,000,000 rows. To check the amount of logging involved, you can query the fn_dblog() function, as follows. (Make sure you're in tempdb while doing so.)

SELECT COUNT(*) FROM fn_
  dblog(NULL, NULL);

Let's start with a table variable. Since there's no option to use SELECT INTO, the code in Web Listing 5 uses an INSERT statement to populate a table variable with 1,000,000 rows. On my test machine, this code ran for 17 seconds and produced about 1,000,000 log records. Now, let's try a temporary table, with the code in Web Listing 6. On my test machine, this code ran under 1 second and produced a little more than 1,500 log records, which explains the performance difference.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here
  • SP1?
    I know there is a SP1 for SQL 2008 R2 available....and there is a "feature pack" as well... ...
  • SQL database mirroring
    I have SQL Server 2008 R2 Enterprise 64bit on Windows 2008 R2 Enterprise 64bit.  Each SQL Server has...
  • Dell Compellent Disk Drive
    Does anybody has experience with Dell Compellent Disk Drive? Basically, this system manages all disk...
  • Sql server performance tuning
    I need to find a tool that help me to optimize sql server,queries,improve the performance and solve ...