NULLs, UNION ALL, and random sorting
Editor's Note:Send your experts-only T-SQL tips to Itzik Ben-Gan at blackbelt@sqlmag.com. If we use your tip in the magazine, you'll receive $100 and an exclusive T-SQL Black Belt shirt.
Sorting isn't always as simple as it seems. From time to time, T-SQL programmers receive requests for output sorted in a unique or unusual way and must handle both the logical aspects and the performance implications of those requests. For example, whenever you provide a sorting solution that involves NULLs, you must determine how to handle them. The sorting ideas of Alexey Rubana software engineer, developer, and DBA from Ukraineand SQL Server Most Valuable Professionals (MVPs) Fernando G. Guerrero and Tibor Karaszi motivated me to write this article.
Getting NULLs to Sort High
NULLs are special, no doubt. Like some children, they seem to require special care and attention. Sorting output by columns that allow NULLs is no exception. NULL stands for either an unknown or an irrelevant value. So, in general, you can't assume that one NULL is the same as another. For example, to find rows that have NULLs in a specific column, you might issue the following query:
SELECT * FROM T1 WHERE colwithnulls = NULL
However, according to the ANSI standard, you won't get any rows in the output, even if the colwithnulls column contains some NULLs. To get all rows that contain NULLs in a certain column, you must use the IS NULL clause instead of = NULL. Or, you can turn off the ANSI_NULLS session option so that SQL Server treats NULL comparisons that use an equals sign as if you had specified IS NULL.
Regardless of the ANSI_NULLS setting, certain operations (e.g., the UNIQUE constraint in SQL Server, GROUP BY, ORDER BY) treat NULLs as if they were equal to one another. The ANSI standard specifies that NULLs should sort into one group as if they were equal, but it doesn't specify whether NULLs should sort lower or higher than actual values (i.e., whether they should appear at the beginning or the end of the output). Therefore, different ANSI-compliant relational database management systems (RDBMSs) implement NULL sorting differentlysome placing NULLS at the end and some placing them at the beginning.
SQL Server sorts NULLs as if they're the lowest value. For example, let's select all rows from the Publishers table in the Pubs database, ordering by state. Then, use a new SQL Server 2000 optimization technique, which lets you create indexes on computed columns, to put an index on the state column:
CREATE INDEX idx_nci_state ON Publishers(state)
SELECT * FROM Publishers
ORDER BY state
Figure 1 shows this query's execution plan. The query performs an ordered index scan on the idx_nci_state index you created. This scan, which improves the query's performance, prevents an explicit sort operation from taking place because the index is already sorted by the state column. Note that the Publishers table is very small. If the table were bigger, the optimizer might choose not to use an index. Using an index involves visiting a page to look up the data row for each key in the index, and the I/O involved might cost more than simply scanning the table and performing an explicit sort. Nevertheless, for this demonstration, the Publishers table's size is convenient.
Suppose you want to sort NULLs high. You can use a CASE expression in the ORDER BY clause:
SELECT * FROM Publishers
ORDER BY CASE
WHEN state IS NULL THEN 1
ELSE 0
END, state
The first value in the ORDER BY clause will be either 1 (when state is NULL) or 0 (when state has an actual value); this value determines the output's primary order. The values in the state column determine the output's secondary order so that you get the desired order for the rows in the output. Note that the ANSI-92 standard specifies that the ORDER BY clause can use only expressions that appear in the SELECT list, but ANSI-99 removed this limitation. SQL Server has never had that restriction.
Figure 2 shows the execution plan for the above query. Notice that a performance problem exists. The query performed an explicit sort operation based on the 1 and 0 values that the compute scalar operation calculated. You can't use the state column index here because the state column doesn't determine the output's primary order; a computation based on the state column does. If you use SQL Server 2000, you can handle this problem elegantly. Add the following computed column to the Publishers table:
ALTER TABLE Publishers
ADD statenullslast AS CASE
WHEN state IS NULL THEN 1
ELSE 0
END
Create a composite index on the statenullslast and state columns:
CREATE INDEX idx_nci_statenullslast ON Publishers(statenullslast, state)
Now, try the following query:
SELECT pub_id, pub_name, city, state, country
FROM Publishers
ORDER BY statenullslast, state
Figure 3 shows the execution plan for this query. Note that the plan uses the new index instead of performing an explicit sort operation.