Normalization best practices that can give an existing database new life
When SQL Server Magazine asked me to write an article about how to normalize a database, I thought writing the article would be easy. However, I soon realized that the task of explaining all the things that I do when I evaluate an existing database was going to be harder than I thought. Many of the steps in my process are automatic for mehow could I put them into a logical order that would make sense to someone else?
Typically, a client doesn't ask me specifically to evaluate a database's level of normalization. Clients ask me to perform a database performance and integrity audit, which can include examining an apparent loss of data integrity. In a performance and integrity audit, I review and analyze many details, and reviewing the database architecture is only one part of the process.
When I analyze database performance, tuning, and integrity, I first look for obvious problems, resolve them, then investigate more minute details. I look across the three layers of a database environment: the physical layer, the external layer, and the logical layer. The physical layer consists of hardware and networking (i.e., the physical components that support the database server). The external layer consists of the applications that are running against the database (i.e., batch jobs, ad hoc queries, and Web accesses). The logical layer consists of the user schema (i.e., the user-table architecture) and database-configuration parameters.
I start my audit by reviewing the physical layerthe hardware and network componentsto eliminate operational bottlenecks, if possible, by implementing simple fixes such as increasing memory, hard disk, or network capacity. While I'm examining the physical layer, I also gather information about SQL Server file groups to learn how the database files are arranged on disk. This information shows me what the database's physical environment looks like. From the client, I learn the purpose of the database. I also get the client's description of the apparent loss of integrity. Are queries returning fewer records in the result sets than expected? More records than expected? Different values than expected? Or is the problem something such as corrupted data in the tables after a database operation?
Next, I review the external layer, which includes the applications running against the database, to make sure that resource contention or locking conflicts aren't causing the performance and integrity problems. I use this sequence because rearranging the job mix or rewriting just one application so that it runs more efficiently is easier than restructuring the underlying files. I gain insight into the database's primary purpose by reviewing the programs that run against the database. I learn whether the database captures transactions (as it would in e-commerce), acts as a data warehouse, performs trend analysis, or is a hybrid that both captures transactions and produces information for decision support.
Finally, I look at the logical layerthe user-table schema and the configuration parameters. Based on the size and purpose of the database, I check the configuration parameters (e.g., buffer sizes, number of worker threads, tempdb, cache hit ratio) to be sure that nothing is grossly out of alignment. Then, I need to analyze only the user-table schema to determine whether the performance or integrity problems are associated with improper levels of data normalization.
Diagnosing Normalization Problems
Either undernormalizing or overnormalizing user tables can cause database performance and integrity problems. An undernormalized database typically has one table that contains many columns and data about many topics. Figure 1 is an example of such a table in a financial-analysis database. Each record in the table contains information about a financial product that an analyst is working on, the category into which the product falls, the contact person for the product, and a checklist for managing each product. Four subjects are stored in this table; this kind of overcrowding makes the data difficult to manage. The rule of thumb to follow is one table, one topic.
Conversely, an overnormalized database contains many tables, each of which has only a few columns. So a query that generates even a simple report must join many tables. Figure 2 is an extreme example of an overnormalized lookup table for regions. In this table, the regions don't have namesthey have unique numbers. The table in Figure 2 has a region identifier, RegionID, which is also a number data type. Any query that includes a reference to region has to join to this Region table on the RegionID, even though RegionID and Region will likely have the same values.
Somewhere between these two extreme examples is a middle ground that you need to find for your production database environment. If you undernormalize, you'll encounter data-integrity problems when you try to insert, update, or delete records. An insertion anomaly happens when you don't know all the values of a multicolumn primary key, so you can't insert a record into the table. An update anomaly happens when you change data and you have to change more than one record in the table because the data is repeated many times in the same table. A deletion anomaly happens when you remove a record from a table and you lose unassociated or loosely associated facts in the process. For more information about modification anomalies, see SQL by Design, "The Four Integrities," November 1999, InstantDoc ID 6200.
Prev. page  
[1]
2
3
4
next page