• subscribe
August 19, 2003 12:00 AM

Try, Try Again

Readers offer alternative solutions to the relational-division puzzle
SQL Server Pro
InstantDoc ID #39604
Downloads
39604.zip

Nöth's Solution
Nöth's solution originally used derived tables, but I use views again to present it in steps. This solution also compares course materials and counts of materials used for both courses. However, Nöth adds another important aspect that makes the solution highly efficient. First, the solution compares several aggregations to find pairs of courses that have a good chance of having the same course materials, then it counts the matching base rows in CourseMaterials. By "good chance," I mean that by comparing only the result of aggregate functions of different sets, you always get matches for sets with the same members, but you might also get matches for sets with different members. For example, the count, sum, minimum, and maximum of the sets {1,3,5,7} and {1,2,6,7} are the same. However, the probability that several different aggregate functions will return the same results for sets with different members is low, so the solution has to compare the actual set materials for a small percentage of the courses.

The first step in this solution is to run the code that Listing 5 shows to create the CMAgg view, which calculates several aggregations for each course. The view returns, for each course, the count of course materials and the minimum, maximum, and sum of materialids. Figure 3 shows the results of a SELECT * query against CMAgg.

The second step is to create the CM view by running the code that Listing 6, page 18, shows. The view's query joins the CMAgg view to itself, returning pairs of courses that return the same results for the different aggregate functions and also explicitly returning the count of their course materials, which is used in a later step. As I mentioned earlier, courses that return the same results for the different aggregate functions are most likely to have the same course materials. Notice that the view's query in Listing 6 looks only for matching courses that have different course IDs and uses the greater-than operator (CM1.courseid > CM2.courseid) to eliminate duplicates (e.g., if courses 3 and 2 have matching aggregates, you get the pair 3, 2 but not 2, 3). Figure 4 shows the results of running a SELECT * query against CM. Note that you get only one row for the sample data I provided in August's Listing 1. For large data sets, the query would probably return more rows, but it always returns only the pairs that have matching aggregations and different course IDs.

The third step is to create the DT view by running the code that Listing 7 shows. The view's query filters for only pairs of courses from CM that have the same count as the number of matching course materials for the pair of courses. In the WHERE clause's subquery, this step calculates the number of matching course materials for the pair of courses. The code groups the result by courseid1 and returns the minimum courseid2 for each unique instance of courseid1. Figure 5 shows the results of running a SELECT * query against DT.

To finish this solution, you simply write a query that returns, for each course in Courses, the minimum courseid in DT or the course ID itself from Courses if a match doesn't exist in DT. Listing 8 shows the complete solution. The left outer join between Courses and DT returns the rows from Courses that have no match in DT. The COALESCE() function returns the desired results.

If you'd rather not use views for reasons such as those I described earlier, you can try the one-step solution that Listing 9 shows. This solution uses only derived tables. Nöth's set-based solution, with or without the derived tables, is amazingly efficient, running in less than 1 second on my laptop against a table of about 55,000 course materials. It also scales very well, running in 2 seconds for 110,000 rows and 3 seconds for 165,000 rows.

Never Give Up
In last month's benchmarking sidebar, my set-based solutions gave poor performance compared to the hybrid set-based and iterative solutions. Tuttle's and Nöth's alternative solutions prove that you should never give up trying to find an efficient set-based solution. Next month, I'll show two readers' solutions to puzzles I presented in January and February 2003.



ARTICLE TOOLS

Comments
  • Itzik Ben-Gan
    9 years ago
    Sep 21, 2003

    Hi, Alejandro,

    It's encouraging to see that people don't give up on pure set-based solutions! Your solution is interesting and runs for a minute on my laptop against ~55,000 rows. I've got a few more pure set-based solutions from other readers, and one of them even performs better than all the solutions that I presented in the September column. Stay tuned until the October column, where I present a couple of those solutions. Thanks for sharing your solution.
    --Itzik

  • Alejandro Mesa
    9 years ago
    Sep 05, 2003

    I really enjoy every single article on this Web site. Here's another try at solving the relational-division puzzle:

    select c.courseid,
    case
    when coalesce(min(a.c2), c.courseid) > c.courseid then c.courseid
    else coalesce(min(a.c2), c.courseid)
    end as min_courseid
    from courses as c
    left join
    (
    select cm1.courseid,
    cm2.courseid,
    count(*)
    from courseMaterials as cm1 inner join courseMaterials as cm2
    on cm1.materialid = cm2.materialid
    where cm1.courseid <> cm2.courseid
    group by cm1.courseid, cm2.courseid
    having 2 * count(*) = (
    select count(*)
    from courseMaterials as cm
    where cm.courseid in (cm1.courseid, cm2.courseid)
    )
    ) as a(c1, c2, mats)
    on c.courseid = a.c1
    group by c.courseid

    I'm not totally proficient in English yet (still learning), but here's my explanation of the solution. Joining the table courseMaterials to itself by materialid and grouping on both course ids (cm1.courseid, cm2.courseid) having its count(*) multiplied by 2 equal to the number of materials used by both course ids will give us the pairs of courses that use the same materials.

    If we left-join the table courses to the previous result and group by courseid, we can have the minimum between the courseid and the ones that match.

    Regards,
    Alejandro Mesa

You must log on before posting a comment.

Are you a new visitor? Register Here