DOWNLOAD THE CODE:
Download the Code 24024.zip

Just as with merge join, hash join requires only one pass through each table. If you're used to measuring query performance by looking at the logical I/O required, you might be surprised. One pass through each table requires very little I/O, but that doesn't mean queries that use hash join are as fast as queries that use merge join. You need to consider the computational effort of building the hash buckets from the build input as well as the memory required to store the hash buckets. SQL Server generally chooses hash join as the join strategy for queries that have no useful indexes on either join column.

Let's look at an example of when SQL Server chooses hashing. The code in Listing 1 makes a copy of the Orders and Order Details tables from the Northwind database but doesn't build any indexes on the tables. The plan for the SELECT query in Listing 2 shows that SQL Server performs a table scan on each table and uses a hash join to join the tables together. If you turn on STATISTICS IO and execute Listing 2's query, you'll see that the number of reads isn't high. The output shows only 21 reads for the o2 table and 10 reads for the od2 table, which correspond to the number of pages in each table. SQL Server reads each page only once, and for large tables, the number of reads might be very large. However, most of the work for processing a hash join occurs in building the hash buckets, and the cost of executing the query is high because of memory resources required to store the hash buckets.

Although using hash joins is a big improvement over using nested-loop joins without any good indexes, Microsoft intended hash joins primarily for use in ad hoc queries submitted while an end user is running an application. If you're testing an application and you find that certain queries always use hash joins, take that as a hint that you need to do more tuning. You might consider building indexes on the join columns in one or both of the tables, for example. Hash joins are good for times when you can't tune adequately because the actual queries are unknown during application development.

Finding the Best Indexes
Now that we've looked at the different join algorithms and discussed how indexes can help the performance of JOIN queries, you might wonder whether a particular method exists for figuring out which indexes to put on your tables for optimum join performance. Unfortunately, I haven't found one. Knowing how SQL Server processes joins internally can help you get started choosing acceptable indexes, or you can use the Index Tuning Wizard, but there's no substitute for testing. Try out as many different combinations of possibly useful indexes on the tables being joined as you can.

For example, you can start with an index on your primary keys (which you should already have if you defined your primary keys as a constraint on the table). You'll have to decide whether to make the index on the primary key column (or columns) clustered or nonclustered. I won't discuss that topic in detail now, but you can try each one and compare performance. Sometimes a clustered index performs better, and sometimes a nonclustered index does. Before you make the final decision about whether the primary key should have a clustered or a nonclustered index, you need to test all the possible queries that will use the table, including all your data-modification operations. You'll probably also want to consider indexes on your foreign key columns and columns used for search arguments, grouping, and sorting.

As you can see, configuring indexes to get the best performance is a complex task. Next month, I'll look at some specific examples that show how slight modifications in the choice of indexes can make a big difference in the performance of JOIN queries.

End of Article

Prev. page     1 [2]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

 
 

ADS BY GOOGLE