In November 2001, a SQL Server developer posted a seemingly simple problem to the public newsgroup microsoft.public.sqlserver.programming. Like other SQL Server Most Valuable Professionals (MVPs), I read this newsgroup almost daily, and this post caught my attention. The developer had written a GROUP BY query, but it wasn't working. The problem was a mysterythe query was simple, and the syntax was correct. All the evidence showed that the query should work. But as I and a few other newsgroup participants looked into the problem, we discovered that the database contained redundant information because the design failed to enforce a business rule. This real-world problem turned out to be a textbook example of the importance of understanding normalization, functional dependencies, and anomalies. Theory met practice, and the developer who originally posted the question got a by-the-book solution. The result was a working query that also provided some insight into database design. Let's look at the clues we discovered and see how they helped solve the mystery. If you want to work through the original scenario or understand the newsgroup participants' thought processes, you can read the original newsgroup thread, "GROUP BY Problem," at http://groups.google.com/groups?q=hancox+kass+query.
Case Notes
Suppose you're the developer who posted the question and you work for Everything Made Easy, Inc., an organization that runs workshops about a variety of topics, including pie making, wallpaper hanging, and plumbing. Listing 1's code creates a simplified version of the tables you use to track workshop participants and payments and that inserts some sample data into those tables.
You didn't design the database, but you're responsible for using and maintaining it. When the company receives a workshop registration form, you generate an invoice and insert the invoice details as a new row in the Invoices1 table. Each row represents one invoice and contains a unique invoice ID, the total fee, and the balance due on that invoice after you deduct any prepayment the customer sent with the registration. The Registrants1 table tracks customers, and each row in that table holds a customer's name, the workshop he or she signed up for, and the associated invoice number.
Your mission is to produce a summary report for all workshops. For each workshop, the summary must show the number of invoices, the total fees, and the total unpaid balance. The assignment seems easy, but unfortunately, something isn't working. By using the query that Listing 2 shows, you can easily generate a summary for one workshop at a time. Listing 2's query produces the correct summary for the Pies workshop, as Figure 1 shows. But when you try to use the query that Listing 3 shows to generate all the summaries at once, the query fails.
Following the Clues
Figure 2 shows Listing 3's results. The output accounts for five invoices, but the Invoices1 table contains only four, and the total fees and outstanding balances are too high. Closer inspection reveals that the result is wrong only for the Pies workshop totals. Two questions come to mind. First, why did the query fail to give the correct result? And second, how can you fix the query so that it works? Chances are good that if you can answer the first question, you can write a query that works.
The Invoices1 table contains the data for the counts and sums, and the Registrants1 table relates the invoices to the workshops. The query results are wrong because Registrants1 relates Invoice 2 to the Pies workshop twice. That's because Boban and Margo registered for Pies together on Invoice 2. The database needs to keep track of both Boban and Margo, but does it need to store both the InvoiceID value (2) and the workshop name (Pies) twice? In other words, is any of the information in Registrants1 redundant?
Just the Facts
What is redundant information? Repeated data isn't necessarily redundant information. If I have two friends named Max, my address book repeats the data "Max", but the data isn't redundant. Information is redundant when a fact is recorded twice, directly or indirectly. The name Max is duplicated in my address book, but the duplicates represent separate facts. However, the single entry for my friend Paxton, in New York, NY 10003, records the fact that Paxton lives in New York, NY, twice: once directly and a second time indirectly. The entry contains the city and state where Paxton lives, and it contains his ZIP code; ZIP code 10003 is always in New York, NY. Because a US ZIP code always determines the city and state, the city and state information in my address book is redundant, assuming no missing ZIP codes.
What does this have to do with Everything Made Easy's database? To determine whether the Registrants1 table contains redundant information, you need to know whether the table records any facts more than once. Each row in Registrants1 contains a participant's name, a workshop name, and an invoice number. Invoices 2 and 4 are both listed with the Pies workshop, so you can't know an invoice number just by knowing the workshop namethe invoice number isn't redundant. But is the workshop name redundant? Can you determine the workshop name if you know the invoice number?