DOWNLOAD THE CODE:
Download the Code 46117.zip

Note: The solutions presented in this article require SQL Server 2005 Beta 1 or higher. They were developed and tested on a pre-release version of SQL Server 2005 Beta 2 by Lubor Kollar and Itzik Ben-Gan.

Imagine that you need to know whether one person is an ancestor of another person. A data structure called transitive closure can help. For example, Fosco Baggins is the father of Drogo, and Drogo is the father of Frodo; therefore, Fosco is an ancestor of Frodo by transitivity. Simply put, transitive closure is a data structure, constructed in this case from family trees, that can help you quickly answer such questions. More formally, transitive closure is an extension or superset of a binary relation such that whenever (a,b) and (b,c) are in the extension, (a,c) is also in the extension. In essence, this data structure lets you efficiently answer "reachability" questions such as, Does a path from point a to point b exist? The input to the process (which in this case is T-SQL code) that generates the transitive closure is a directed graph (pairs of nodes with direction that can be followed in only one way).

Many databases store such graphs—for example, a bill of materials (BOM) with assembly/item pairs, Web links from source page to target page, and employee hierarchies. The output is pairs of nodes that have a path between them. For example, the transitive closure of a BOM input is all pairs of containing/contained items, be they contained directly—e.g., if a cake contains cream, return (cake, cream)—or indirectly—e.g., if a cake contains cream and cream contains sugar, a cake transitively contains sugar; return (cake, cream), (cream, sugar), (cake, sugar).

The transitive closure of a Web links graph is all pairs of source/target pages—again, with direct or indirect links between them. Once the transitive closure of the input graph is established, you can simply determine whether a certain pair exists to answer questions such as, Does item a contain item c? and, Can I get from page x to page y by following Web links? You can find the formal definition of transitive closure and related terms and algorithms at the National Institute of Standards and Technology (NIST) Web site (http://www.nist.gov/dads/HTML/transitiveClosure.html).

You might want to generate transitive closure in a database when your application needs to answer, as quickly as possible, frequent reachability requests against a graph stored in your database. For example, suppose you get many queries against the BOM in your database asking whether one item, such as a chocolate cake, contains another item, sugar. You can use T-SQL to implement an iterative or recursive algorithm that, for each request, traverses the graph starting with the containing item and looks for the contained item.

However, if you're familiar with such iterative solutions, you know that they're usually slow and costly in overhead. Alternatively, you can write T-SQL code that generates a transitive closure of your BOM and materializes it as a table. The table will contain the transitive closure of the BOM—namely, all pairs of items that have a path between them. As soon as your code materializes the transitive closure of your BOM, you can use it to answer all queries about whether one item contains another with a simple query that performs one seek operation within the index you created on the pair of attributes.

A related problem is finding the shortest path between two nodes in a graph. For example, if you have Web links information stored in your SQL Server database, your application might need to answer frequent requests for the minimum number of Web links that would take you from one Web page to another. Robert Floyd's and Stephen Warshall's transitive closure algorithms are commonly used in other computing environments. (For details, search the keywords Floyd and Warshall at http://www.nist.gov/dads.) However, although these algorithms might offer the most convenient programming solutions for the aforementioned problems, they don't work for relational databases because they can't be expressed in SQL queries.

With the introduction of recursive common table expressions (CTEs) in SQL Server 2005, you can provide pure T-SQL solutions to the transitive-closure and shortest-path problems. Let's look at some of these now. We first present solutions for acyclic graphs, in which no path exists from a node to itself; these are fairly simple. Next, we examine the more complex solutions for cyclic graphs, in which a path can exist from a node to itself. We assume here that you're familiar with recursive CTEs. For information about recursive CTEs, see "Get in the Loop with CTEs," May 2004 (InstantDoc ID 42072), and "Cycling with CTEs," June 2004 (InstantDoc ID 42452).

Acyclic Graphs
Our examples use a Web links scenario. Before we begin, run the code that Listing 1 shows to create the Links table and populate it with sample data. The Links table contains a row for each Web link, including two attributes: source page ID (src_pid) and target page ID (tgt_pid). The sample data represents an acyclic graph, meaning that no path you can follow leads from a page to itself. Figure 1 depicts the acyclic directed graph represented by the Links table.

The first situation we'll look at is a request to generate the transitive closure of the Links table. Table 1 shows the desired output, which contains all pairs of source/target page IDs that have a path between them. Listing 2 shows the solution to the request.

At callout A, the anchor member of the CTE called TC returns all (source, target) Web-page pairs from the Links table, representing target pages that can be accessed directly from the source pages (following a single link). The recursive member at callout B joins the result of the previous step (TC aliased as P for Parent) to the Links table (aliased as C for Child) in order to return target pages that a user can access indirectly from the source pages (by following more than one link). The JOIN condition looks for rows in Links where the source page ID is equal to the previous step's target page ID. The SELECT list returns the source page ID from P and the target page ID from C.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.