DOWNLOAD THE CODE:
Download the Code 24024.zip

A little organization goes a long way

In "Indexes and Joins," March 2002, InstantDoc ID 23731, I explained two of the techniques that SQL Server can use to process join operations. Nested-loop joins, probably the most common type of join, use the algorithm that most people think about when writing join operations. That is, for each qualifying row in one table, SQL Server will find all the rows in another table that match that row on the join column (or columns). Merge joins are a particularly efficient join method that requires only one pass through each input, but SQL Server can use it only when both data inputs are already sorted by the join column.

Now let's look at the third algorithm—hashing—that SQL Server can use to process your join operations. Remember that in most cases, letting the SQL Server query optimizer determine the type of join to use is the best practice. But when you're familiar with the ways SQL Server can process joins, you can understand what you're seeing when you examine a query plan.

Hash Joins
Hash joins, like merge joins, were introduced in SQL Server 7.0. SQL Server can use this join technique when no useful index exists on the join column in either table. (This lack of indexes also implies that the data isn't in a useful sorted order, because you need an index on a table to ensure that it's already internally sorted.) Hashing, a well-understood algorithm in computer science, is commonly used for searching algorithms on many kinds of systems. I present a simplified definition of hashing in the following paragraphs.

In "Indexes and Joins," I showed that if the data is already sorted, a merge join is a highly efficient method for finding matching rows. If the data has no predefined order and no set limit on the number of times a value can occur, any search operation has to look at every row—which, for a huge table, can be very time-consuming. I also demonstrated that for a small input, SQL Server might decide to sort the input before the join so that it could then perform a merge operation. But for a large table, what are the options? Sorting millions of rows before joining could end up taking more time than a brute-force nested-loop join on completely unordered rows. Hashing provides a middle ground. It lets SQL Server organize the data into categories to make finding matching rows more efficient, but this organization isn't a complete sort. Hashing lets you determine whether a particular data item matches an existing value by dividing the existing data into groups based on some property. You can put all the data with the same value for that property into a hash bucket, which is like a temporary table. Then, to see if a new value has a match in the data already examined, you simply look in the bucket for the right value.

For example, suppose you need to keep all the wrenches in your workshop organized so that you can find the right size and type when you need it. You can organize them by their size property and put all half-inch wrenches (of any type) together, all 10mm wrenches together, and so on. When you need to find a 15mm box wrench, you simply go to the 15mm drawer to find one (or to see whether the last person who borrowed it has returned it). You don't have to look through all the wrenches, only the 15mm ones, which greatly reduces your searching time. Alternatively, you can organize your wrenches by type and have all the box wrenches in one drawer, all the socket wrenches in another drawer, and so on. So although your wrenches aren't stored in a completely sorted sequence by both type and size, just putting like wrenches in the same drawer (or bucket) lets you limit your search to that one drawer. Some organization is better than none.

Hashing data in a table isn't quite so straightforward. You need a property by which to divide the data into sets of manageable size with a relatively equal membership. For integers, the hashing property might be something as simple as applying the modulo operator to each existing value. If you decide to hash on the operation modulo 13, every value is put in a hash bucket based on its remainder after dividing by 13. You need 13 buckets because the possible remainders are the integers 0 through 12. In real systems, the hash functions are substantially more complex and the number of buckets can be in the tens of thousands or more; coming up with a good hash function for an application is an art as well as a science. When SQL Server uses hashing to join two inputs, SQL Server uses one input—the build input—to build the hash buckets. Then, one row at a time, it inspects the second input—the probe input—and tries to find a match in the appropriate existing bucket. If it finds a match, SQL Server returns the required columns from the probe input row and the matching build input row.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.