Does your candidate for primary key pass the four tests?

Readers frequently ask me whether they should use a natural value or a surrogate identifier for a table's primary key. A surrogate key is an artificially produced value, most often a system-managed, incrementing counter whose values can range from 1 to n, where n represents a table's maximum number of rows. In SQL Server, you create a surrogate key by assigning an identity property to a column that has a number data type. A natural key is a naturally occurring descriptor of the data and one of a table's attributes that has no duplicate values. When you use a natural key as a table's primary key, each of the table's rows is uniquely identified. I addressed this concept in SQL by Design, "How to Choose a Primary Key," April 1999, InstantDoc ID 5113, but sometimes a topic is important enough to revisit.

Data modelers (for this discussion, I include anyone who has designed tables for a database) are divided on this question: Some modelers swear by the surrogate key; others would die before they used anything but a natural key. A search of the literature about data modeling and database design supports neither side except in the data warehouse arena, in which a surrogate key is the only choice for both dimension and fact tables. That lack of consensus leaves those of us in the transactional database world to wrestle with the question of which key makes a better primary key—a natural key or a surrogate key. The ultimate answer is a resounding "It depends." Whether you choose a surrogate key or a natural key depends on your data, your database platform, and the group that wields more power in your organization—the DBA group or the applications development team. Let's look at the way each of these factors can influence the choice of a primary key.

The Data
To help you decide whether a potential primary key is appropriate, I recommend that you ask the questions that point to a primary key's four criteria—Is the primary key unique? Does it apply to all rows? Is it minimal? Is it stable over time? If the answer to any of these questions is no, your candidate fails to meet the four criteria and you need to look for an alternative. Each user table should have a designated primary key, which you create in a CREATE TABLE statement or in an ALTER TABLE statement. Although SQL Server assigns a row identifier (RID) to each record in a file, users and user programs can't use an RID as a uniqueidentifier, because you can't guarantee that an RID will remain constant over time. Therefore, an RID doesn't meet the criterion that a primary key's value must never change. An RID is composed of a file number, a page number, and a page's row number. As the position of each record shifts in the file, the record's associated RID changes. The primary key you choose must have a constant and unchanging set of values. When you assign a primary key value to a record, the value must not change for the life of that record and you can't reuse the value—even after the record is deleted from the table. In addition to these logical reasons for not using an RID as a primary key, you can't access it through any supported programming interface.

We know that a primary key needs to be a uniqueidentifier for each row in a table, but basically, the primary key has to apply to all rows. For example, designating the ISBN as the uniqueidentifier for a table of published titles wouldn't make sense if you're collecting both new and old books and the old books were published before the ISBN system was established. Making sure that the primary key applies to all rows becomes especially important when you use a concatenated primary key—a key that's composed of two or more columns.

Let's illustrate the need to make sure that the primary key applies to all rows by looking at a retail store example that uses the customer ID and date of purchase as a Purchase table's primary key. Figure 1, page 54, shows this primary key in an entity relationship diagram (ERD). As long as you remember to enter sales transactions during the day the sales were made, your database contains accurate information. But the first time you fail to enter transactions on the day sales were made (maybe you wanted to leave the store early on the weekend, thinking that you would catch up and enter the weekend sales on Monday), you have to guess about the date a sale occurred. Now you have almost-accurate data in your database. Although "close enough" is acceptable data integrity for some organizations, it clearly falls short for others. If, as a result of choosing your primary key unwisely, you discover that you're missing data for part of the primary key, you might have to choose one of the three following courses of action:

  • create dummy data
  • enter almost-accurate data
  • exclude the event from the database

Not one of these three options is a suitable alternative to storing accurate, timely data in your database.

   Prev. page   [1] 2 3     next page
 
 

ADS BY GOOGLE