SideBar    Grouping Web Site Hits into Sessions
DOWNLOAD THE CODE:
Download the Code 23713.zip

In a data-mart extract, adding an index to the table slowed my query down. The index led SQL Server to choose a merge join rather than a hash join. You can tell by looking at the graphical query plans in Query Analyzer. A merge join would have been fine, except that the inequality conditions in the JOIN clause forced the merge to be many-to-many (M:N), which is much slower than one-to-many (1:M). In this case, leaving the indexes off made the merge join less attractive to the SQL Server optimizer. SQL Server chose a hash join, and the query ran faster. If you implement this algorithm for your project, you should test and examine the query plans before you add or remove any indexes.

Tweaking the indexes can be helpful, but reducing the quantity of data before you join the tables is better. One approach to reducing the quantity of data to join is to use the start-date results to find end dates more quickly. Depending on the data, the start-date table might be a fraction of the size of the original table. If you can join the start-date table to the original table instead of joining the original table to itself, the query's performance should improve.

You're really looking for two kinds of end dates. One kind of end date falls at the start of a temporary lapse in the subscription: for example, Phil's Car and Driver expiration of October 1, 1991. The other kind of end date is the last one for a particular subscriber and magazine—for example, Phil's final Car and Driver expiration of July 1, 2001, and Andrea's subscription expirations of January 10, 2000, and March 7, 2000. The end dates that you want in your final results fall into one of these two end-date categories.

If you've already generated a table containing the results of Listing 2 (the start-date query), you can use this table to find the first category of end dates. (Callout A in Listing 7, page 6, creates the start-date table, and callout B in Listing 7 populates the table.) The first category of end dates all fall before some start date. In fact, the end dates you want are the "last," "maximum," or "most recent" ones before a given start date. Callout C in Listing 7 shows the query that finds these end dates, those that begin a temporary lapse of subscription.

Finding the second kind of end dates doesn't require a join at all:

SELECT m.subscriber_name, m.magazine_name, MAX(m.subscription_end) AS 
  subscription_end
FROM magazine_subs m
GROUP BY m.subscriber_name, m.magazine_name

A faster alternative for finding end dates is to combine these two queries by using a UNION ALL, as Listing 8 shows. A simple UNION would also work, but it would not perform as well. UNION alone implicitly adds DISTINCT to a query. In our case, the DISTINCT is unnecessary and would force the query engine to do extra work. In real life, the data-mart equivalent of magazine_subs contained 16 million rows. Using the query in Listing 3 to find the end dates took about 2 hours; the query in Listing 8 took 40 minutes.

This algorithm works just as well when you're grouping columns other than subscriber_name and magazine_name. For example, if you're only interested in the periods of time during which a subscriber was receiving any magazine, you can omit magazine_name from Listing 4 to get the code that Listing 9 shows. The result still reflects the gap in Phil's subscriptions, but now Andrea's Poodle Patrol phase is swallowed up in her long Cat Fancy subscription. (To see the periods when anyone was subscribing to any magazine, you would remove both magazine_name and subscriber_name.)

More Than One Application
The continuous time and date problem comes up more often than you might think—for example, collecting Web tracking data (Web site hits by site URL and caller IP address) into sessions. This problem looks quite different from the continuous time and date problem, but it isn't. Rather than focusing on overlapping intervals, this problem needs to group points in time. The sidebar "Grouping Web Site Hits into Sessions," page 4, shows how you could use a variation of the magazine-subscription problem to solve the problem of identifying Web sessions.

This T-SQL solution avoids loops and cursors. Developing such a set-oriented algorithm can be challenging, but the result is more elegant than an iterative approach and, in my test, performed better. The more you look around at various problems you need to solve, the more applications for this algorithm you're likely to find.

End of Article

Prev. page     1 [2]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

good advise

marinadavydova

Article Rating 5 out of 5

 
 

ADS BY GOOGLE