Join operations. Columns containing skewed data can also influence the performance of join operations. Consider a table called orders, which contains 500,000 rows. Listing 6 shows the statement that creates the orders table, and Listing 7 contains the batch that loads the table's rows. Suppose 20 percent of your business comes from PC sales, and controllers account for only 0.05 percent of total sales. Also suppose that all relevant columns in the query (customers.customerid, orders.customerid, item_name) are indexed. If you ask for the average income of customers who bought PCs, as Listing 8 shows, SQL Server has to read all rows from the customers table and 80,000 rows from the orders table to perform the join. Because both tables contain so many rows, the query optimizer chooses not to use existing indexes and instead executes the join operation by using the hash join method that Figure 4, page 64, shows. A hash join performs best on tables that differ considerably in size, on unsorted rows, and on nonindexed columns. This join method is two-phased. First, SQL Server builds a hash structure consisting of the smaller (outer) table's values, which have been hashed, then probes each row on the larger (inner) table for matches in the hash table. The hash join scans the inner table only once. In contrast, to execute a join of the customers and orders tables, based on controller sales, as Listing 9, page 64, shows, SQL Server has to read only 200 rows from the orders table. For this reason, the query optimizer chooses to execute the join by using the nested loop join that Figure 5, page 64, shows. In a nested loop join, SQL Server accesses the outer table row by row. Then, looking only for matching rows, SQL Server scans the inner table, which must contain an indexed join column. This join method produces I/O for each index scan of the inner table (i.e., I/O for 200 scans).
Curbing Skewed Data's Misdoings
You can use several methods to enhance response time for queries that have to contend with skewed data:
- Use multiple threads to scan one table. If your SQL Server system runs on an SMP computer, the query optimizer applies parallel table scans.
- Divide the table by using local partitioned views.
- Avoid using variables in batches.
You can apply all these methods in situations where the query optimizer uses a table scanthat is, where the selectivity of a column's value is low.
If your computer has several processors, SQL Server might split a table's rows into different logical parts and process these parts on different processors, thereby permitting parallel table scans. If the query optimizer applies a parallel table scan, you can significantly enhance the response time of Listing 3's query, which accesses skewed data in the customers table. By using sp_configure, SQL Server autoconfigures the max degree of parallelism configuration option, which controls the use of parallel queries.
When you apply parallel scans, SQL Server groups the table's rows. Because this grouping is an internal process, a user can't directly influence it. However, if you use distributed partitioned views (available only in SQL Server 2000), you can specify how to partition the table's rows. The main advantage of distributed partitioned views is that the query optimizer knows which values of the partitioning column are stored in which partition and can omit the search in views that don't contain the value from a given predicate. For example, consider the query in Listing 4. If you create two distributed partitioned views, one with females and one with males, the query optimizer will search the rows of the all-male view to satisfy the query. (For more information about distributed partitioned views, see the three-part series by Kalen Delaney and Itzik Ben-Gan: "Distributed Partitioned Views in SQL Server 2000," August 2000, "Querying Distributed Partitioned Views," September 2000, and "Modifying Views with INSTEAD OF Triggers," October 2000.)
As for avoiding using variables in batches, let's look again at Listing 5. To execute the query in Listing 5, the query optimizer can't use the existing statistics because a variable in one of the predicates is joined by the AND operator. For this reason, avoid batches that contain a variable. If you have to use such a query in a batch and if the join column's value has high selectivity, consider using the INDEX query hint to maintain index access.
Accounting for Data Skew During Database Design
Because many columns in relational tables contain skewed data, carefully consider your data distribution when you design your database. First, take skewed data into account when you create indexes for your database tables. The way you use indexes for columns containing skewed data depends on the values of the column that you name in a predicate. Second, be aware that the size of tables containing skewed data directly impacts performance by index access and table scans. Index access becomes faster and table scans slow down as tables become larger, resulting in an increasing performance gap between the two access strategies. Hence, the effect of skewed data is highest in the data-warehousing field.
As you've seen in this article, the query optimizer accesses skewed data effectively. The only case in which the query optimizer doesn't recognize the selectivity of a column's value is when the predicate contains a variable. However, if your query's predicate contains a column that has skewed data, you can enhance your query's response time by using parallel table scans or local partitioned views and by avoiding variables in batches.
End of Article
Prev. page
1
[2]
next page -->