Enhance query response time by adopting these methods for accessing skewed data
A table's columns usually represent the properties of real-life entities such as employee and product names. When your query contains a predicatea WHERE clause containing search criteriathat includes a column, SQL Server's query optimizer can find the data that satisfies the query in two different ways: by using a table scan or index access. In a table scan, the query optimizer sequentially reads all rows of a table and compares each row with the search criteria in the WHERE clause. SQL Server typically decides to scan the entire table when the query selects a significant number of table rows. The query optimizer considers index access when a column index (clustered or nonclustered) exists. The decision about whether to use the existing index depends on many different factors. (For a description of factors that influence the use of index access, see "Which Is Faster: Index Access or Table Scan?" January 2000.)
If the query optimizer has to decide how to access a column that contains nonunique values, the distribution of distinct values for the column across a table's rows becomes an important consideration. Suppose your organization stores all customer names in the customers table, which includes two columns named gender and country. Also suppose your business is based mainly in North America and that the country column has 10 distinct values. If 70 percent of your customers live in the United States, 10 percent in Canada, 5 percent in Mexico, and 0.05 percent in Germany, the distinct values in the country column aren't uniformly distributed across the rows of the customers table. In this example, the country column's data is skewed because the frequency with which this column's existing values (United States, Canada, Mexico, Germany, and so on) appear in the rows of the customers table differs significantly. In contrast, the gender column lists roughly the same number of males and females, so the gender column's data isn't skewed. Let's look at different SELECT statements to see how the query optimizer decides to select the resultset in different situations in which the skewed-data column is in the WHERE predicate. In preparation, you need to first write a T-SQL statement that creates the customers table, which Listing 1, page 62, shows. You can then use the batch that Listing 2, page 62, shows to load 100,000 rows into this table.
Skewed Data Causes Plenty of Mischief
Skewed data can affect the query execution plan of queries containing a predicate and queries containing a join operation. Let's first explore the impact on queries containing a predicate.
Evaluating the predicate. Suppose you want to find out how many male customers you have in Germany. Because Germany accounts for only 0.05 percent of your customers in the earlier example, the predicate that Listing 3 shows will select at most 50 customers. The relationship between a predicate and the number of rows it chooses exemplifies the concept of selectivity. A predicate's selectivity is the ratio of the number of rows satisfying the WHERE clause to the total number of rows in the table. The selectivity of an index is said to be high, as in this scenario, when the index key value identifies only a few rows. If selectivity is high and if a corresponding index exists, the query optimizer typically uses index access because retrieving a relatively small number of rows is much faster with the sorted row order that the index access uses than sequentially scanning the whole table. Applying the index access strategy, the query optimizer uses the existing index i_country to satisfy the query, as Figure 1 shows. In contrast, Listing 4's predicate, which identifies US customers, selects significantly more rows. In this case, the best access strategy is a table scan, which Figure 2 shows, because reading all data pages sequentially is significantly faster than reading specific pages by using the existing index. So we can see that the query optimizer chooses a table scan if the I/O estimate for index access is greater than the number of pages that the table fills. Here, selectivity is low.
In Listing 3 and Listing 4, the query optimizer chooses the query plan you expect. However, if you try to write a generic SELECT statement that you can use to query any country your customers live in, the query optimizer's choice for data access might surprise you. Following the earlier line of reasoning, you might conclude that the query optimizer would choose the index scan for executing the batch that Listing 5 shows because the @country variable's value is set to Germany. Instead, the query optimizer uses a table scaneven if the corresponding index for the country column exists. The reason for this behavior is simple: The query optimizer can't use the existing statistics as the queries in Listing 3 and Listing 4 did because one of Listing 5's predicates contains a variable. And Listing 5's batch declares the @country variable in the same scope as the query. The query optimizer doesn't know the value of the variable when it chooses the access method because it doesn't perform multiple passes when it optimizes the query. Thus, Figure 3 shows that the query optimizer used the table scan because the Row count parameter has a value of 0. The query optimizer assigns this value when it doesn't know the number of selected rows.
Prev. page  
[1]
2
next page