T-SQL is a language that lets you formulate requests in a logical manner. By "logical," I mean that when you write a query, you specify what results you want but not how you want to get them. Devising how to process the query is the job of the query optimizer. Every problem you face that requires a T-SQL solution usually has many different solutions that eventually return the same result. In an ideal world, given two different queries that perform the same task, the optimizer would generate the same execution planthe optimal planfor both. The SQL Server development team seems to be moving in this direction. With every release of SQL Server, the optimizer devises more sophisticated and efficient plans, and the chances increase that different queries that perform the same task will use the same plan.
However, in many situations, the way you write a query can still dramatically affect the query's performance. To let the optimizer choose among different plans and find an efficient one, you need to do some preparationnamely, creating indexes. Most people create regular indexes as part of their database tuning, but I haven't seen indexed views implemented widely. Indexed views can dramatically improve query performance, especially when you're aggregating data. In some situations, after creating and indexing views, you don't even need to change your original queries that refer to the base tables because the optimizer is smart enough to use those indexes.
Let's look at a couple of problems in which using indexed views yields improved performance. I'm assuming that you're already familiar with the basics of indexed views. (For details about indexed views, see Kalen Delaney's "Introducing Indexed Views," May 2000, InstantDoc ID 8410.)
The New Customers Problem
The first problem involves a typical marketing requestfor each period (e.g., month), what is the number of new, existing, and total customers based on order activity? A month's new customers are customers who placed an order for the first time. Existing customers are those who placed orders in earlier months. Use the Orders table in the Northwind database as your initial base data. Table 1 shows an abbreviated version of the desired output. After creating a solution and verifying that your code is logically correct, test its performance against a larger Orders table, which you can create in tempdb by running the code that Listing 1, page 18, shows.
Listing 2, page 18, shows my first solution to this problem. The innermost query defining the derived table M returns the distinct month and customer combinations from Orders. The month is expressed as a datetime value that falls on the first day of the month. The query defining the derived table N uses a CASE expression that returns, for each month and customer combination, a 1 if the customer hadn't placed an order before the current month and a 0 otherwise. In other words, the CASE expression returns 1 if the customer is new. The outermost query groups the rows in N by month and calculates the total number of new customers by summarizing all the 1s in the column named new in the N table. The total number of customers is the number of rows in the group (COUNT(*)) because each row in the group represents a different customer. The number of existing customers is the total number of customers minus the number of new ones.
When I wrote this solution, I didn't take performance into consideration; I thought only of solving the query logically. The query incurred more than 137,000 logical reads and ran for 17 seconds on my laptop against the large Orders table I created in tempdb. Regular indexes provided little helpafter I created a composite index on the customerid and orderdate columns, the query ran for 15 seconds. The problem called for a different approach, and this time, I decided to keep performance in mind.
The larger Orders table can have many occurrences of orders for each customer each month. If you had a summarized table that contained only distinct combinations of months and customers, you could write a better-performing query. Summarizing tables is what indexed views are all about. Run the code in Listing 3 to create an indexed view called Vymcusts that contains only distinct combinations of months and customers. Note that you have to include the COUNT_BIG(*) function inside a view that uses aggregations if you want to index the view. When SQL Server knows the number of rows in each group, it can maintain the indexed view efficiently.
Next, I tested several different solutions, all of which performed much better than the first. Let's look at the best-performing solution I came up with. I created another view called Vnewcusts that contains the minimum year/month (ym) value for each customer:
CREATE VIEW Vnewcusts
AS
SELECT customerid, MIN(ym) AS min_ym
FROM Vymcusts (NOEXPAND)
GROUP BY customerid
The minimum ym value for a customer is the month of that customer's first order. The NOEXPAND hint tells the optimizer to treat the indexed view as a table and not attempt to use indexes from the base table Orders. Finally, I left-joined Vymcusts (VY) to Vnewcusts (VN). A row from VY finds a match in VN if they both have the same customer and the current month in VY is the minimum month for the customer (VY.ym = VN.min_ym). The outer join returns NULLs in VN.customerid and in all columns from VN if the month isn't the minimum month for the customer. The query groups the result of the outer join by the VY.ym column. SQL Server calculates the total number of new customers by counting the number of non-NULL values in VN.customerid. Listing 4 shows the final query, which ran for 1 second on my laptop and incurred only 280 logical reads. That performance is satisfactory.