Nöth's Solution
Nöth's solution originally used derived tables, but I use views again to present it in steps. This solution also compares course materials and counts of materials used for both courses. However, Nöth adds another important aspect that makes the solution highly efficient. First, the solution compares several aggregations to find pairs of courses that have a good chance of having the same course materials, then it counts the matching base rows in CourseMaterials. By "good chance," I mean that by comparing only the result of aggregate functions of different sets, you always get matches for sets with the same members, but you might also get matches for sets with different members. For example, the count, sum, minimum, and maximum of the sets {1,3,5,7} and {1,2,6,7} are the same. However, the probability that several different aggregate functions will return the same results for sets with different members is low, so the solution has to compare the actual set materials for a small percentage of the courses.
The first step in this solution is to run the code that Listing 5 shows to create the CMAgg view, which calculates several aggregations for each course. The view returns, for each course, the count of course materials and the minimum, maximum, and sum of materialids. Figure 3 shows the results of a SELECT * query against CMAgg.
The second step is to create the CM view by running the code that Listing 6, page 18, shows. The view's query joins the CMAgg view to itself, returning pairs of courses that return the same results for the different aggregate functions and also explicitly returning the count of their course materials, which is used in a later step. As I mentioned earlier, courses that return the same results for the different aggregate functions are most likely to have the same course materials. Notice that the view's query in Listing 6 looks only for matching courses that have different course IDs and uses the greater-than operator (CM1.courseid > CM2.courseid) to eliminate duplicates (e.g., if courses 3 and 2 have matching aggregates, you get the pair 3, 2 but not 2, 3). Figure 4 shows the results of running a SELECT * query against CM. Note that you get only one row for the sample data I provided in August's Listing 1. For large data sets, the query would probably return more rows, but it always returns only the pairs that have matching aggregations and different course IDs.
The third step is to create the DT view by running the code that Listing 7 shows. The view's query filters for only pairs of courses from CM that have the same count as the number of matching course materials for the pair of courses. In the WHERE clause's subquery, this step calculates the number of matching course materials for the pair of courses. The code groups the result by courseid1 and returns the minimum courseid2 for each unique instance of courseid1. Figure 5 shows the results of running a SELECT * query against DT.
To finish this solution, you simply write a query that returns, for each course in Courses, the minimum courseid in DT or the course ID itself from Courses if a match doesn't exist in DT. Listing 8 shows the complete solution. The left outer join between Courses and DT returns the rows from Courses that have no match in DT. The COALESCE() function returns the desired results.
If you'd rather not use views for reasons such as those I described earlier, you can try the one-step solution that Listing 9 shows. This solution uses only derived tables. Nöth's set-based solution, with or without the derived tables, is amazingly efficient, running in less than 1 second on my laptop against a table of about 55,000 course materials. It also scales very well, running in 2 seconds for 110,000 rows and 3 seconds for 165,000 rows.
Never Give Up
In last month's benchmarking sidebar, my set-based solutions gave poor performance compared to the hybrid set-based and iterative solutions. Tuttle's and Nöth's alternative solutions prove that you should never give up trying to find an efficient set-based solution. Next month, I'll show two readers' solutions to puzzles I presented in January and February 2003.