Seek and destroy these normalization anomalies

Modeling data isn't always straightforward. Novices tend to cobble a model together without regard to data modeling's finer points—starting with a good definition of requirements. If the designer has inadequately defined the requirements, the resulting model often works poorly—or not at all. Untangling a badly designed model takes a lot of work. You have to evaluate tables that already exist in a database, first by determining whether the tables are normalized. A typical normalization anomaly that stops a design cold is a multivalued attribute—a condition in which a table's column has been assigned multiple values. And finding multivalued attributes embedded in tables as lists or collections can be tricky. As I use the terms, a list or a collection is a set of values stored like a series of simple attributes in a table. The members of the list or collection are treated as separate columns in the table, each with a different column name. For example, you can have a list or collection of phone numbers—home phone, work phone, work phone 2, and fax number. Let's look at how to detect a multivalued attribute embedded in a table and how to correct the condition.

Before I discuss multivalued attributes, I need to point out that this column's data models include acronyms that you aren't accustomed to seeing in my columns. In my new (to me) CASE package—Sybase's PowerDesigner—the data types are generic: A stands for character (or alpha) data type, VA for variable character, BL for Boolean, I for integer, and DT for datetime. The <M> indicator in the far right of each entity box means a "mandatory" attribute—NOT NULL in SQL Server terms. Don't worry: PowerDesigner data types really do convert to the correct SQL Server data types when you generate a physical data model.

Figure 1 shows a model of a simple inventory system for computer hardware and software that any company might use. An employee probably designed the model in a hurry to track installed-hardware inventory and software licenses. The model comprises the SERVER table, which contains data about each computer server; the SOFTWARE_PKG table, which contains data about software packages; an associative table (INSTALLED_SOFTWARE), which cross-references SERVER and SOFTWARE_PKG and represents the many-to-many (M:N) relationship between the two tables; and the VENDOR table, which contains a listing of the software-package vendors and is a parent to SOFTWARE_PKG. VENDOR sustains a one-to-many (1:M) relationship with SOFTWARE_PKG.

For the most part, this model is relatively well normalized and will work adequately for a while. VENDOR, SOFTWARE_PKG, and INSTALLED_SOFTWARE are all in third normal form (3NF). You shouldn't encounter any update, deletion, or modification anomalies when you manage data in those three tables. (For more information about database normalization, see Jeffrey Bane, "The First Abnormal Form," August 2001, InstantDoc ID 21369; Solutions by Design, "Responsible Denormalization," October 2000, InstantDoc ID 9785; and SQL by Design, "Why You Need Database Normalization," March 1999, InstantDoc ID 4887.) However, the SERVER table presents a problem. It looks somewhat like a checklist that contains all the attributes that you need to describe a server configuration, but the table is in a pre­first normal form (1NF). You read that correctly—the SERVER table isn't even in 1NF because it contains a list and a collection.

As I use the terms, a list or a collection is a set of values that have been forced to act like attributes. For example, in the SERVER table, Disk1Capacity doesn't truly describe a server's attribute; Disk1Capacity describes a server component's value. Also, as long as each server contains no more than four disks, you can record the disk capacities. But as soon as you upgrade one server to a fifth disk, you face the problem of how to record the new disk's capacity. Do you add the capacities of the fourth and fifth disks together and store the result as Disk4Capacity? Or do you modify the SERVER table structure, then add a new column called Disk5Capacity to the table? If you add a new column and if the database is running with ANSI NULLS turned on, a report that generates total disk capacity by summing Disk1Capacity through Disk5Capacity produces a NULL result for any server that doesn't contain a fifth disk.

Lists and collections might be wonderful for programming against, but they are miserable for data management. For example, if you arrange a server's attributes as Figure 1 shows—in a checklist—you can write a query that simply selects each attribute by name, without joining the SERVER table to another table. This form of "flat-file database" has its proponents. Data retrievals are fast (you don't have to join tables), and data organization is simple. The problem is that data management is very difficult. Extending the table architecture is awkward and might cause negative repercussions in associated objects such as reports. (You might have to rewrite every report that accesses the table that's involved in the re-architecture.)

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

This article is great, but what happen when you don't know the name of the attibutes or they are variable, for example the attributes for people on a marketing database (do you have a car?, what model? Insurance?, credit card?, etc, to name a few), can you help me with that?

Thanks in advanced

Luis Abrego