Data Extraction
Many SQL Server developers have difficulty extracting data from a hierarchical structure. Your next assignment is to provide the appropriate SQL statements to give Internet users efficient access to the tables you created to house entity relationships. For the example data, let's model a little of Earth's geographical map; Figure 5 contains some sample data. Data that is derived from tables containing hierarchical data is often presented in a Web-based tree control. Have you ever noticed how some Web pages seem to take an inordinately long time to expand a tree control? During the delay between every click of the + sign, these pages are actually performing a round-trip to the server for every node in the tree. They are issuing database requests to the server to retrieve the children of the node you just clicked.
Performing numerous round-trips to a database is an extremely inefficient means of interacting with both the database and the user, for several reasons. Mainly, a Web page is largely ineffective as a means of interacting with a user if the user must wait several seconds between clicks. Internet users generally will wait a few extra seconds up front if their subsequent successive requests are instantaneous. Also, every round-trip to the database has associated overhead, including database connection time, SQL parsing, and SQL execution-plan generation.
A preferable means of implementing the tree control is to download the entire result hierarchy at the first request, then use Dynamic HTML (DHTML) to expand the tree. The only drawback of such a design is that you might be transferring an enormous data set. If your tables contain tens of thousands of records (e.g., if you were trying to present a list of cities by country), you might first filter the resultset by asking the user to specify a subset of the data (perhaps through a combo box selection) before issuing the SQL statement to download the cities.
But before you can code the HTML to present your data, you need to write the SQL statements that will extract data and send it to the client. Using a tree control, the client will ultimately render the resultset data on the screen. The SQL statements that generate the resultset will be part of a stored procedure, which begins at a specified starting point and returns a set of records containing all related entities.
Before diving into a subterranean exploration of the SQL composing this stored procedure, which Listing 4 shows, let's take a hundred-foot view. The goal of this stored procedure is to find all entities that are related to the starting point. Using our previous example, your starting point might be San Francisco, in which case the stored procedure would return all entities that have any relationship, direct or indirect, with that city. This example uses the Planet Earth record as a starting point for retrieving all related records, which should be all the records in the table. The records returned will contain information about what entity each is related to, how many levels separate them from the starting point, and so on. Using the resultset, HTML programmers can render the entire tree control without the need for successive round-trip database requests, because the resultset contains the relationships between entities, rather than the entities themselves.
The procedure starts by creating a temporary table called #Map, which it will use as a symbolic data stack. (The stack is symbolic because SQL Server has no inherent stack capabilities, but you can simulate a stack by using a field called Processed in the #Map table.) The procedure pushes records onto the stack, then pops them off the stack. If the Processed field's value is set to 0, the record is considered to be on the stack; if the field's value is 1, the record is considered to have been popped off the stack. After creating the #Map table, the code proceeds by pushing the first record onto the stack; this record is the starting point from which the code will begin to look for related records. The procedure then enters a looping structure, which pops the top record from the stack. The procedure uses this record to find related records, which the procedure subsequently pushes onto the stack. When the stack is empty (i.e., all records have the Processed field set to 1), the procedure returns the records it found and stored in the #Map table, and you're done.
Note that this stored procedure will return all entities related in any way to the starting point. If you want only the entities of a certain relationship type, you need to modify the stored procedure. For example, if Person A is related to Person B by a "blood relationship" and Person B is related to Person C by a "friend relationship," then with Person A as a starting point, the procedure will return all three because A is related to B and B is related to C, making C related to A. To find only the persons related by blood to Person A, you need to modify this stored procedure by specifying the relationship type in the WHERE clause. Now let's go through the SQL code in Listing 4 in more detail.
Callout A in Listing 4 defines the stored procedure name as ap_Map_Traversal. The prefix "ap" signifies that the procedure is an application procedure and not a system procedure. This nomenclature helps when you want to easily differentiate Microsoft-supplied procedures from those you need to run your application. The procedure takes one parameter, @startPoint, which is an entity identifier; the procedure will begin looking for related entities from this origin point. If the parameter value is null, Callout B initializes the starting point to a default starting point. You need to adjust SET NOCOUNT ON to reflect your default starting point.
Callout C declares an integer variable, @level; the procedure will use this variable as a counter that reflects the number of levels that separate an entity from the starting point. Callout D creates the #Map temporary table. The #Map table has six columns:
- Itemcontains the entity's unique identifier
- ItemNamecontains a descriptor for the entity defined in the Item column
- RelatedToItemcontains an entity's unique identifer, to which the Item column is related by the relationship specified in the Relationship column
- Relationshipcontains the relationship type by which RelatedToItem and Item are related
- Generationcontains an integer value that specifies the number of levels by which the item is separated from @startPoint, which has a generational value of 1
- Processedcontains a value of 0 or 1, specifying whether the point (as defined by the Item column) is awaiting processing or has completed processing
Callout E inserts the root node into the temporary table #Map. Callout F initializes the @level variable value to 1. Callout G comprises the looping structure, which will loop as long as the @level variable is greater than 0. In effect, the variable acts as an altimeterit measures how far from the starting point each entity is. As the result items' generations get further from the starting point, this variable's value increases, and as they get closer to the starting point, the variable's value decreases. When the variable's value is 1, the procedure is back at the starting point; when the value reaches 0, the procedure has processed all records related to the starting point, and the loop construct is finished.
Now we enter the loop. Callout H provides the SELECT statement to check for the existence of records that are awaiting processing. If no records are found, the @level variable is decremented by one (Callout I). If records exist, execution continues with Callout J, at which point the code extracts only the top record's Item value and sets the Processed column value to 1 (popping the record off the stack). The INSERT statement in Callout K adds into the #Map table all records the Item is related to and sets the Processed column to 0, pushing the records onto the stack. Note that the procedure sets the Generation to the current generation's value plus 1 because these records are separated from the starting point by a generation, so for the first time through the loop, the generation would be incremented to the value 2.
You can determine whether records were inserted into #Map by querying the global variable @@rowcount, which returns the number of records affected by the last statement executed. If this value is greater than 0, records were inserted, so you need to increment the @level variable so that the next iteration through the WHILE loop will process the subsequent generation. The execution follows a Depth First Search (a means of traversing data in which you exhaust depth before breadth) approach to processing. Processing will continue in this way until the procedure has exhausted all records in the #Map table, at which point the @level variable's value becomes 0. This value causes the loop to terminate and the procedure returns the contents of the #Map table through Callout M. Figure 6 shows the final output of running this procedure.
Advanced Maps
The T-SQL code in Listing 4 works well with data in which no circular relationships exist. If Person A is related to Person B by a "blood relationship" and Person B is related to Person C by a "friend relationship" and C is related directly to A, you have a circular loop in which A is related to C, who is related to A. Maps are structures that contain these circular loops; maps are nonlinear, nonhierarchical points connected by bidirectional linesthe very essence of multifaceted relationships. So although the stored procedure code in Listing 4 works when only one path exists between two points, it fails when it encounters a loop formation such as a multifaceted relationship. The code will enter into an infinite loop as it tries to process nodes that have children that are in their own ancestry. For example, suppose you add to the EntityRelationship table the two records that Figure 7 shows. The records establish a relationship between North America and Canada as parent and child and between the United States and Canada as trading countries. After you insert these records, rerun the stored procedure. The stored procedure never returns because it has entered an infinite loop; if you let the stored procedure run indefinitely, it will raise an error when the #Map table has exhausted database capacity.
The solution to this infinite loop asserts practicality over elegance. You need an indicator to tell the procedure that it has encountered a loopin other words, a point that has already been processed. In some implementations, the indicator is part of the data itself. For example, some implementations contain special columns that indicate a loop if the value is set to 1; however, such an implementation is inadvisable because it renders the entire solution data-sensitive. A data-sensitive solution requires that the person who writes the code to traverse the map have complete control over the data generation. Getting this control is unlikely, especially when you're trying to integrate with a third-party data provider. We encountered this type of situation when working on a project in which a law-enforcement agency acquired $10 million worth of proprietary data from the local electric utility company. The utility company wasn't about to change millions of recordsso we needed to write a stored procedure that would not go in circles ad infinitum.
The ideal solution is to have the procedure keep a list of all points processed, then before processing any point, verify that it hasn't already encountered the point. If the procedure has processed the point, it can simply skip that point. Because the #Map table is a list of points that are either pending processing (Processed column = 0) or already processed (Processed column = 1), the solution is simple. Inserting the following T-SQL clause into Listing 4 at Callout L will prevent the insertion of entities that already exist in the #Map table.
AND eteIDEntity2 NOT IN (SELECT relatedToItem FROM #Map)
Now if you rerun the stored procedure, you will find that it no longer encounters an infinite loop, but returns the resultset correctly. Congratulations! You are now in a position to effectively manipulate complex data involving multifaceted relationships. Your database model can now support complex organization charts, genealogies, geographical data, and any other relationships by which two entities can be related in more than one way. Now be sure to model those simple country, state, and city relationships the right wayno shortcuts!