Using arrays and checksum values. The first modification I tried was to use the CHECKSUM_AGG() function, which returns an integer checksum value for a set of values in a group. I created a view that for each course returns an array and a checksum value for the set of materials:
CREATE VIEW VCSArr
AS
SELECT courseid,
CHECKSUM_AGG(materialid) AS CS,
dbo.fn_generatearr(courseid) AS arr
FROM CourseMaterials
GROUP BY courseid
Hoping that the checksum values would help the query optimizer partition the data, I also used both values in the JOIN condition:
SELECT A.courseid,
MIN(B.courseid) AS mincourseid
FROM VCSArr AS A
JOIN VCSArr AS B
ON A.cs = B.cs
AND A.arr = B.arr
GROUP BY A.courseid
When the query processed up to a few thousand courses, the VCSArr-view solution performed in half the time of the VArr-view solution, but when the query attempted to process 10,000 courses or more, both solutions performed almost the same. So, the search for a faster solution continued.
Using arrays and temporary tables. I knew that temporary tables can help store intermediate results of calculations, which you can index and access several times. So, I decided to see how well temporary tables would work with an array. Similar to the VArr view, the following code stores course IDs and arrays of their material IDs in a temporary table called #TArr and creates an index on arr and courseid:
SELECT courseid,
dbo.fn_generatearr(courseid)
AS arr
INTO #TArr
FROM Courses
CREATE UNIQUE CLUSTERED INDEX
idx_uc_arr_courseid
ON #TArr(arr, courseid)
Note that an index row can't exceed 900 bytes, so arr is limited to 896 bytes because the courseid column is also included in the index. This restriction means that you're limited to 249 materials per course. If you need more materials, you can use the result of a CHECKSUM() function on the array in an index instead of using the array itself. As in the VArr-view solution, you use a query to match courses that have the same array, as follows:
SELECT A.courseid,
MIN(B.courseid) AS mincourseid
FROM #TArr AS A
JOIN #TArr AS B
ON A.arr = B.arr
GROUP BY A.courseid
This solution performs well and scales well. In my tests, it ran in less than 6 seconds for up to 10,000 courses and in about a minute for 100,000 courses (about 550,000 rows in CourseMaterials). The following code generates the temporary table, including for each course the array and a checksum of the course materials:
SELECT courseid,
CHECKSUM_AGG(materialid) AS cs,
dbo.fn_generatearr(courseid) AS arr
INTO #TCSArr
FROM CourseMaterials
GROUP BY courseid
If the array length exceeds 892 bytes (900 bytes minus 8 bytes for the checksum and courseid values), exclude it from the index; if the array is less than 892 bytes long, include it:
CREATE UNIQUE CLUSTERED INDEX
idx_uc_cs_arr_courseid
ON #TCSArr(cs, arr, courseid)
Finally, the following query (which is similar to the query against the VCSArr view) returns the desired results:
SELECT A.courseid,
MIN(B.courseid) AS mincourseid
FROM #TCSArr AS A
JOIN #TCSArr AS B
ON A.cs = B.cs
AND A.arr = B.arr
GROUP BY A.courseid
Note that when arr fits in the index, this solution provides no performance improvement over the VArr-view solution.
Array or Not?
Each successive array-based solution to the relational-division puzzle performed better than the previous one. Tuning is an ongoing process: When you're not satisfied with your solution's performance, you can probably find a faster way to achieve the desired results. Pure set-based solutions should be your first choice because they're usually short, portable, and easy to maintain. But hybrid arrays that combine iterative and set-based logic might perform better.