Last month, in "Debunking the Myths of Temporary Objects, Part 1" (InstantDoc
ID 96288), I began a discussion about temporary objects, which include temporary
tables, table variables, and table expressions. I gave you an overview of each
object type, described its characteristics, and provided recommendations for
when to use each. This month, in the interest of making last month's discussion
more clear, I dig deeper by giving some concrete examples that will clarify
that discussion. Before you read this month's article, be sure that you've read
last month's article in preparation: The information in that article is an important
prerequisite for understanding this article's examples.
Relying on Distribution Statistics
In this article's examples, I'll use a table called Orders with 1,000,000 rows.
To create the Orders table in the tempdb database (you can also create it in
a user database of your choice), run the code in Web
Listing 1 (http://www.sqlmag.com, InstantDoc ID 96478), and populate it
with sample data. The fact that I'm creating the table in tempdb has nothing
to do with temporary objects. I'm using tempdb simply because I want the table
to be cleared upon SQL Server restart. I often use tempdb for test purposes.
My first example involves the need to aggregate order data by the hour, calculating
the total quantity and number of orders per hour, and to submit multiple queries against
the aggregated data. This scenario precludes table expressions, leaving you with the
choice of whether to use a temporary table or a table variable.
Temporary table approach. Run the code in Web
Listing 2 to create and populate a temporary table called #OrderTotalsByHour
with the hourly aggregates. With four years worth of data, you should have more
than 35,000 rows in the temporary table. Web
Listing 3 shows the set of queries that you need to submit against #OrderTotalsByHour.
Note that in queries 3 and 4, I specify a constant in the filter that represents
the highest qty column value from the temporary table (in my case, 2880). The
reason I use that value is that there should be a very small number of rows
with an equal-to or greater-than value (i.e., highly selective), and a very
large number of rows with smaller values in the qty column (i.e., low selectivity).
I've used random values to populate the Orders table, so before you run these
queries and analyze their plans, first obtain the maximum hourly quantity value
from the temporary table and replace the constant 2880 in queries 3 and 4 with
the value you get from the following query:
SELECT MAX(qty) FROM
#OrderTotalsByHour;
Web Figure 1 shows the execution plans
I got for the four queries in Web Listing
3. If you want to analyze the execution plans, enable the Include Actual
Execution Plan option in SQL Server Management Studio, and run the code in Web
Listing 3.
Notice that all queries get ideal plans.
Observe the estimated number of rows
and the actual number of rows reported
in the plans. The selectivity estimations
that the optimizer makes are very close to
reality. Queries 1 and 2 have trivial plans
because in both cases there's a clustered
index on the filtered column. In this case,
distribution statistics aren't very important
because the optimizer would figure out
the ideal plan without the need to estimate selectivity anyway. However, with
queries 3 and 4, selectivity estimations are
very important. I can see that when the
filter was highly selective (query 3), the
optimizer chose to use the nonclustered
index created on the qty column. When
the filter had low selectivity (query 4), the
optimizer instead opted for a full clustered
index scan. The reason the optimizer
made good choices is because it could
make reasonable selectivity estimations
based on the availability of distribution
statistics.
In my case, query 3 ended up doing only five logical reads (i.e., seek in the
non-clustered index, plus lookup). Query 4 involved 1006 logical reads—
the number of pages at the leaf of the clustered index (all data pages). Had
the optimizer chosen to use a full clustered index scan with query 3, it would
cost 1006 reads. Had the optimizer chosen to use the nonclustered index with
query 4, it would have resulted in more than 35,000 lookup operations (each
of which is a seek in the clustered index), which would have amounted to substantially
more I/O than performing a full clustered index scan. In short, the optimizer
made good choices.
Table variable approach. Next, run the code in Web
Listing 4 to test the table variable approach and observe the execution
plans that you see in Web Figure 2. (Be
sure to replace the constant in queries 3 and 4.) Notice that with all four
queries, the estimated number of rows returned from querying the table variable
is always 1. The reason should be clear by now: The optimizer doesn't have distribution
statistics, so it can't make reasonable selectivity estimation. Queries 1 and
2, as I mentioned earlier, have trivial plans, so you get good plans regardless
of whether distribution statistics are available. Queries 3 and 4 have different
ideal plans for different selectivity scenarios. The optimizer can't make good
estimations, so it uses a conservative approach for a range filter—namely,
a full scan of the clustered index. With query 3, the choice of execution plan
is unfortunate because you end up paying a lot more I/O than what you potentially
could. With query 4, you can consider yourself lucky that the conservative approach
the optimizer chose happens to fit this case. Sometimes, guesses end up being
right.
Using SELECT INTO
Last month, I mentioned that you can use a SELECT INTO statement to create and
populate a temporary table, but not a table variable. SELECT INTO is a bulk
operation that is minimally logged when the database recovery model isn't set
to FULL. Because the tempdb database's recovery model is SIMPLE, you get minimal
logging when you use SELECT INTO. "Minimal logging" means that only the addresses
of the extents allocated during the operation are recorded in the log to support
a rollback, if necessary. The address of an extent is six bytes in size, as
opposed to the actual extent size, which is 64KB. When populating a temporary
table with SELECT INTO, you get much less logging compared with using INSERT
INTO. With table variables, you don't have the option to use SELECT INTO, so
you'll end up with more logging, resulting in slower population of the table
variable, compared with populating a temporary table with SELECT INTO. Ready
for a tangible example?
The code in Web Listing 1 creates a function
called fn_nums, which returns a table result with numbers in the range of 1
through the requested number of rows. I'll use it to populate temporary objects
with 1,000,000 rows. To check the amount of logging involved, you can query
the fn_dblog() function, as follows. (Make sure you're in tempdb while doing
so.)
SELECT COUNT(*) FROM fn_
dblog(NULL, NULL);
Let's start with a table variable. Since there's no option to use SELECT INTO,
the code in Web Listing 5 uses
an INSERT statement to populate a table variable with 1,000,000 rows. On my
test machine, this code ran for 17 seconds and produced about 1,000,000 log
records. Now, let's try a temporary table, with the code in Web
Listing 6. On my test machine, this code ran under 1 second and produced
a little more than 1,500 log records, which explains the performance difference.