Does your candidate for primary key pass the four tests?

Readers frequently ask me whether they should use a natural value or a surrogate identifier for a table's primary key. A surrogate key is an artificially produced value, most often a system-managed, incrementing counter whose values can range from 1 to n, where n represents a table's maximum number of rows. In SQL Server, you create a surrogate key by assigning an identity property to a column that has a number data type. A natural key is a naturally occurring descriptor of the data and one of a table's attributes that has no duplicate values. When you use a natural key as a table's primary key, each of the table's rows is uniquely identified. I addressed this concept in SQL by Design, "How to Choose a Primary Key," April 1999, InstantDoc ID 5113, but sometimes a topic is important enough to revisit.

Data modelers (for this discussion, I include anyone who has designed tables for a database) are divided on this question: Some modelers swear by the surrogate key; others would die before they used anything but a natural key. A search of the literature about data modeling and database design supports neither side except in the data warehouse arena, in which a surrogate key is the only choice for both dimension and fact tables. That lack of consensus leaves those of us in the transactional database world to wrestle with the question of which key makes a better primary key—a natural key or a surrogate key. The ultimate answer is a resounding "It depends." Whether you choose a surrogate key or a natural key depends on your data, your database platform, and the group that wields more power in your organization—the DBA group or the applications development team. Let's look at the way each of these factors can influence the choice of a primary key.

The Data
To help you decide whether a potential primary key is appropriate, I recommend that you ask the questions that point to a primary key's four criteria—Is the primary key unique? Does it apply to all rows? Is it minimal? Is it stable over time? If the answer to any of these questions is no, your candidate fails to meet the four criteria and you need to look for an alternative. Each user table should have a designated primary key, which you create in a CREATE TABLE statement or in an ALTER TABLE statement. Although SQL Server assigns a row identifier (RID) to each record in a file, users and user programs can't use an RID as a uniqueidentifier, because you can't guarantee that an RID will remain constant over time. Therefore, an RID doesn't meet the criterion that a primary key's value must never change. An RID is composed of a file number, a page number, and a page's row number. As the position of each record shifts in the file, the record's associated RID changes. The primary key you choose must have a constant and unchanging set of values. When you assign a primary key value to a record, the value must not change for the life of that record and you can't reuse the value—even after the record is deleted from the table. In addition to these logical reasons for not using an RID as a primary key, you can't access it through any supported programming interface.

We know that a primary key needs to be a uniqueidentifier for each row in a table, but basically, the primary key has to apply to all rows. For example, designating the ISBN as the uniqueidentifier for a table of published titles wouldn't make sense if you're collecting both new and old books and the old books were published before the ISBN system was established. Making sure that the primary key applies to all rows becomes especially important when you use a concatenated primary key—a key that's composed of two or more columns.

Let's illustrate the need to make sure that the primary key applies to all rows by looking at a retail store example that uses the customer ID and date of purchase as a Purchase table's primary key. Figure 1, page 54, shows this primary key in an entity relationship diagram (ERD). As long as you remember to enter sales transactions during the day the sales were made, your database contains accurate information. But the first time you fail to enter transactions on the day sales were made (maybe you wanted to leave the store early on the weekend, thinking that you would catch up and enter the weekend sales on Monday), you have to guess about the date a sale occurred. Now you have almost-accurate data in your database. Although "close enough" is acceptable data integrity for some organizations, it clearly falls short for others. If, as a result of choosing your primary key unwisely, you discover that you're missing data for part of the primary key, you might have to choose one of the three following courses of action:

  • create dummy data
  • enter almost-accurate data
  • exclude the event from the database

Not one of these three options is a suitable alternative to storing accurate, timely data in your database.

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Your font size is too small. My eyesight is good -- corrected to 20/20 with glasses, but I still have to squint.

Also, using IE's "View->Text size->Larger" menu item has no affect. You ignore this value.

Jon Kyle

Very well stated. easy to follow and understand the explanations.

molly arriaga

It was a great help, very descriptive, good examples. Thanks

Aslam Pervez

Almost make no sense to real developer. The whole article explained some thing every developer knows, and not touch real diffrence between Surrogate Key and Natural Key, pros and cons of each,etc.

Anonymous User

'Almost make no sense to real developer. The whole article explained some thing every developer knows, and not touch real diffrence between Surrogate Key and Natural Key, pros and cons of each,etc. '

'Anonymous User- December 03, 2004 '

Did you (Anonymous User) even read the article or did you just skim over the first paragraph or 2? There's no way you can consider yourself a 'real developer' if you read the entire article and still make the statement that you did. Michelle (the articles author) was very clear on what the differences are between the 2 types of keys as well as provide real world examples/descriptions of each. You would of course have to read the whole article to get those details and so if you did not, which I believe is the case, then you should not be criticizing the author in the way you did.

It's also apparent that not all developers 'knows' this material because of they did there would not be numerous postings on the internet groups/forumns about this issue.

Bottom line, if you are going to be critcial then make sure you've reveiwed in full what you are criticizing else you will look foolish to others reading your comments.

EdCarden_SQL

Article Rating 5 out of 5