The field Ready is also a little ambiguous. You have to know that a wine is generally not drinkable when it's first made. You have to wait a period of time, from 6 months for the Beaujolais Nouveau to many years for the finest French reds. So, the field named Ready is another four-digit year that denotes when the wine will be ready to drink.
So far, all the fields seem straightforward. I have to depend on the client's memory and verbal explanations for the definitions of the fields. Such a situation isn't ideal, but it's the most common situation. An analysis will be more precise if the client has a metadata repository and has kept it updated. The metadata repository can act as a dictionary for the entire databaseit can contain definitions and properties of each of the fields. As part of my analysis, I usually create a glossary that can later be converted or input to a metadata repository tool, if the client wants to do so.
Look for dirty data. I always visually inspect samples of a client's data to determine whether the data is "dirty"that is, whether the client has incomplete or incorrect data in the database and, if so, where it is. Dirty data occurs when tables don't have primary keys or uniquely indexed natural keys, so users can easily insert duplicate rows. Dirty data also occurs when dependent one-to-many (1:M) relationships aren't enforced, allowing rows in the child table that don't have an associated row in the parent table. Sometimes I use queries (specifically outer joins) to determine whether data is dirty; then I log the results to show to the client.
The Cellar database's tables are pretty clean. No dates exist in the TASTEGROUP table or the TASTEWINE table that don't appear in the TASTING table. Each Visitor ID (VID) in the TASTEGROUP table is already in the VISITOR table, and every Bin number in TASTEWINE is already in the CELLAR table. All the YearProduced and Ready values in the CELLAR table fall within a valid range, from 1900 to 2005. Each phone number in the VISITOR table is at least 12 characters long, which implies that the area code and user-friendly formatting (nnn-nnn-nnnn) have been included for each record in which the phone number isn't NULL. I routinely run these kinds of queries to substantiate my suspicions if I believe that a lack of entity or referential integrity controls is causing the loss of data integrity. (For more information about database integrities, see "The Four Integrities.")
Locate redundancy across tables. You have data redundancy across tables when the same non-key data is in two or more tables in the database. For example, if I found a field called Name instead of VID in the TASTEGROUP table, I'd assume that visitor names were redundant in this database. A follow-up visual inspection of the data would confirm my suspicions. If a field is non-keythat is, if a field isn't part of a primary key or part of a foreign keythen the field should appear only one time and in only one table.
If you find what you think is a redundant data condition across tables, you need to investigate more closely. In Figure 3, a datetime type of field is present in three tables: TASTING, TASTEGROUP, and TASTEWINE. All three occurrences of the date field denote when a tasting happened. You might mistake this repetition for cross-table redundancy. However, tdate in the TASTING table is a primary key. Tgdate in TASTEGROUP and twdate in TASTEWINE are part of the primary key for each of their respective tables. Tgdate and twdate are also foreign keys and are part of enforced relationships.
Find transitive dependencies within tables. A transitive dependency is a situation in which one non-key field identifies or determines the value of another non-key field in the same table. You can't find a transitive dependency unless you understand each of the fields, what they mean, and how they relate to one another. If you don't have a metadata repository available, you have to depend on the client's memory and understanding of the data to make sense of it.
In the CELLAR table, which is already on my change list, I believe that I've found a transitive dependency. Comments that describe how a wine tastes aren't dependent on where the wine is stored. When a wine will be ready to drink is also not dependent on where the wine is stored. A combination of Wine plus Producer plus YearProduced determines when a wine will be ready (Ready). Many things determine how a wine tastes: the type of wine, who produced it, and when it was produced. The wine taster records his evaluation in Comments. Thus, Wine, Producer, and YearProduced combine to define the values of Ready and Comments. This combination is the transitive dependency in the CELLAR table. When you find a transitive dependency, you need to break the dependent fields into a separate table and relate the two tables to each other. I add this suggestion to my change list.
Identify less-than-full functional dependency. Full functional dependency means that each non-key field in a table is totally dependent on the entire primary key for its meaning. If the primary key is multicolumn, then each non-key field must be dependent on the entire primary key, not just one part of the primary key. Any table that contains fields that have less-than-full functional dependency on the primary key is in 1NF. Figure 1 is an example of a 1NF table with an embedded array (the checklist). This table doesn't have a primary key. You'd decompose this table into second normal form (2NF), then 3NF, following the process outlined in "The Four Integrities."
Prev. page
1
2
[3]
4
next page