• subscribe
January 24, 2001 12:00 AM

Beyond Hierarchies

SQL Server Pro
InstantDoc ID #16268
Downloads
16268.zip

Multifaceted entity relationships: graphs and maps

Some of the most common, and often flawed, models in relational database management systems (RDBMSs) are dependent on the structures (such as trees, graphs, and maps) necessary for storing entity-to-entity relationships. You need self-referencing tables and their respective parsing mechanisms to model generations and genealogies (trees), organization charts (graphs), and geographies (maps). Using a geographic database as an example, this article explores trees, graphs, and maps and looks at how to extract the data from these structures.

Self-Referencing Tables
Database modelers are often guilty of taking a few shortcuts when performing the onerous task of saving geographical taxonomies in a database. We take these shortcuts and fulfill the requirements, although we know that the requirements haven't taken into account the inevitable scope expansion that is typical of any growing organization. Almost every database in production must have the facilities to store hierarchical data. This article's example focuses on geographic data, but hierarchical data is prevalent in many of today's models. The example database requirements specify that the database must support state and city information. The implementation of only two generations of data appears simple enough—a one-to-many (1:M) relationship will suffice. You know that eventually you'll have to redesign this database relationship; when the company starts selling its product outside the United States, the new requirement will specify that the database needs more than a region-subregion relationship. If you're fortunate, you'll only need to add a third table for country information. If not, you'll need to totally redesign this database model, which takes time and money.

Further investigation of the data in geographic information systems (GIS) map software reveals that countries have complex and disparate organization structures. Some countries are organized by state, others by region or province. Using small areas of the United States and South Africa as examples, you'll find the hierarchies that Figure 1 shows. The disparity in structure between the two countries in Figure 1 shows that you can't model generic organization charts by using a simple 1:M relationship because you can't predict the depth of the hierarchy. In this example, you can see that countries don't abide by an organizational standard. The United States is segmented into states and districts, both of which contain cities. Other countries might be divided into provinces or regions, or perhaps not be subdivided at all. Other hierarchies such as genealogies and corporate organization charts have the same problem. When confronted with disparate organizational structures, many developers opt to use self-referencing tables. One column in every self-referencing table has a foreign-key relationship with another column in the same table. Listing 1 creates a table called Entity, in which entIDEntityParent is a self-referencing column; you can use this table to store the geographic data for our example.

Why did we prefix every column with the characters "ent"? By assigning each table a unique three-character prefix and prefixing every column with the table prefix, you eliminate the need to alias all your joins when you begin creating dynamically generated SQL statements. (You might use dynamically generated SQL to search for a certain piece of data based on user-specified input. Constructing the SQL dynamically lets you omit superfluous joins and thereby increase query performance.) Making all the column names unique could save you hundreds of hours of creating these customized SQL generators. Also, naming the table Entity and not GeographicLocation or something similar provides for maximum reuse because naming the table Entity lets you use this table to store any kind of entity. You can use this table as a template for a domain value table in which you can store any kind of lookup data. (Of course, you might need to add a few more columns as you internationalize—e.g., for language IDs.)

The first column in the Entity table is entIDEntity; this column is a unique identifier that acts as the primary key. Note that we didn't use an integer data type for the IDENTITY column. We recommend that you use the uniqueidentifier data type (new in SQL Server 7.0), provided you take into consideration the 16 bytes of space a globally unique ID (GUID) requires for storage, indexes, and joins. SQL Server developers tend to avoid GUIDs, using integers instead for portability reasons. But using GUIDs will yield benefits if you decide to configure replication. Regarding the use of unique identifiers in replication, SQL Server Books Online (BOL) says: "If you are using merge replication, or if you are using snapshot replication or transactional replication with queued updating and the table that is being replicated does not have a uniqueidentifier column, SQL Server 2000 will add one when you create a publication."

The table's second column, entName, is the field you use to store the country, city, or region name—the friendly name that you present to your users. You can use the third column, entType, to identify the type of entity—e.g., country, state, or city. Finally, the entIDEntityParent column identifies each entity's immediate parent. This self-referencing column works well for the trees that Figure 1 shows. For every entity, you can store the uniqueidentifier of its parent in this column.

Nodes with Multiple Parents (Graphs)
What happens when your database needs to support an organizational structure in which entities can have multiple parents? The table from Listing 1 won't support multiple parents because the relationship is stored with the entity definition. Because you can have only one instance of an entity definition, you're inherently limited to only one relationship for the entity. In a typical hierarchy model or tree, the table in Listing 1 would be fully functional because nodes in a tree have only one parent. (For more information about hierarchy models, see Itzik Ben-Gan, "Maintaining Hierarchies," July 2000.) However, with multiple parents, this structure is no longer a tree; it's a graph, typical of what you might find in genealogy trees or geographical structures. Why would you need to support multiple parents in a GIS environment? Let's say you want to provide an Internet city-search portal where users can search for cities by state, by county, or by area code. But some cities have multiple parents: A user searching by area code would find San Francisco as a city in two area codes, 650 and 415, as Figure 2 shows. So, San Francisco has more than one parent. This example might seem trivial; however, the concept of an entity having more than one parent is crucial in many other situations. For example, in genealogies, humans have two parents—sometimes more if you include stepparents.

Entities that have multiple parents can be problematic for the geographic database model because the model provides for only one column in which to store the entity's parent. Up to this point, you were dealing with entities that have only one parent (a tree), but now you have entities with multiple parents (a graph). Fortunately, a solution is close at hand: Just remove the self-referencing column from the entity table. Listing 2 removes that column and creates a new table called EntityRelationship. Removing the self-referencing column abstracts each entity from its parent-child relationship, thereby letting one entity be involved in more than one parent-child relationship. In other words, some hierarchies involve many-to-many (M:N) relationships, and this EntityRelationship table is an intersection table that resolves the M:N relationship.

Let's consider the alternatives to creating this intersection table. Merely duplicating the entity with a new ID would probably work when the entity has no related records. However, if you decided to duplicate the entity, you would have to manually maintain this duplication of information because the duplication violates the rules of normalization. So, when is denormalization appropriate? The only reason to denormalize data structures is to increase performance. The process of denormalizing can speed up data access but requires increased programming to maintain the denormalized data. An example of denormalizing is storing account balance information along with a customer record. Although you could compute the balance by calculating all transactions, doing so is time-consuming, so DBAs often save the account balance with the customer record. This denormalization results in increased performance—and in more programming. (See Michelle A. Poolet, Solutions by Design, "Responsible Denormalization," October 2000, for denormalization suggestions.) But in a geographic database, and for that matter most databases irrespective of industry, duplicating the entity isn't feasible because of the potentially large amount of information associated with each entity record.

Using our new model, you define an entity in the Entity table, then you define all the entity's relationships in the EntityRelationship table. Consequently, an entity can now have multiple parents. However, the column combination of eteIDEntityChild and eteIDEntityParent becomes the primary key, thus enforcing the restriction that any entity can have only one occurrence of the same parent.

Multifaceted Relationships (Maps)
Just when you think you're ready to deploy your database, you receive a new requirement—to create sibling relationships among entities. Your company wants the Internet city-search portal to track climatic similarities among cities so that when a user searches for a city, the program also shows other cities that have similar climates. In Figure 3, the climatic and scenic relationship between Cape Town and San Francisco is not a parent-child relationship like the Figure 1 country scenario. Thus, the existing database structure no longer works because the EntityRelationship table can't differentiate between a parent-child relationship and a sibling-sibling relationship. To solve this problem, you can use the code in Listing 3, page 58, to modify the EntityRelationship table.

The code renames the fields from eteIDEntityChild and eteIDEntityParent to eteIDEntity1 and eteIDEntity2. The new field called eteRelationship abstracts a parent-child relationship to a generic relationship; now the database design can support many different types of relationships. Because eteIDEntity1 and eteIDEntity2 can have numerous distinct relationships with one another, we added the eteRelationship field to the primary key, as Figure 4 shows, so that you can relate two entities to each other in more than one way. For example, you can relate Cape Town to San Francisco by climate, by landscape, and perhaps even by size. When innumerable types of relationships exist between entities, you have multifaceted relationships. Although these relationships can be difficult to maintain, it is precisely these relationships, deciphered from sophisticated data-mining processes, that let Internet sites upsell and cross-sell. Upselling a product is easy—you just define an Upsell relationship between a particular product and the products that are more expensive versions of it. Cross-selling is just as easy—you define a Cross-Sell relationship between the product and its associated products. Within the Cross-Sell relationship, you could even define Dependency or Accessory relationships (remember, batteries and accessories are hardly ever included).

Several years ago, marketers found a relationship between sales of beer and sales of diapers. The two items had similar purchase times, which at first seemed coincidental. However, further examination revealed that parents would pick up diapers after work and would take the opportunity to grab some beer (or vice versa). So the marketers moved premium beer next to the diapers—and in so doing, they increased premium beer sales.

The future of data mining lies in deriving these relationships and exposing them to smart applications, which can use the relationships to increase sales. Anyone who's visited Amazon.com has probably experienced cross-selling. Personalization, an area based on exposing related data, enables such cross-selling. (For more details about data mining, see Barry de Ville, "Data Mining in SQL Server 2000," January 2001.) Now with the data model we've established, you can capture these relationships by inserting the appropriate relationship records into the EntityRelationship table. Geographic regions, for example, can have many relationships with each other, such as bordering countries/states/regions, demographically similar, or populationally similar.

Let's refer back to the EntityRelationship table now that you see how to identify different types of relationships. You might be concerned about the data type of the column that stores the relationship indicator. In a production system, this column should be a uniqueidentifier data type to increase the performance in accessing this table. But it's OK to cheat a little, especially for readability purposes while you're getting things up and running. For performance reasons, you just need to change this data type when you go into production.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here