Going back to the Northwind Traders e-commerce Web site, the query that Listing 1 shows is a likely choice for a Web page that shows a customer all orders he or she has placed. As a rule, any time you can remove a table from a multitable SELECT, you improve the performance of the query. In this case, the ProductID column in the Order Details table lets the query retrieve the product name from the Products table. If you assume that the product name won't change and that the ProductID will never be associated with another product, you could improve the database design by adding a ProductName column to the Order Details table. This addition saves joining the Products table to the query, and as both the Products and Order Details tables grow, that improvement to the design will become even more significant.
The drawback to this improvement is that the Order Details table will become significantly larger than it would under the original design because it now contains a 40-character product name for each item ordered. Given the size of today's hard disks, storage of the additional data isn't a problem; however, the new column will require more data to pass through the disk's I/O buffers, which might lead to a decrease in the total throughput. In addition, each row of the Order Details table will require more RAM in the data cache, which means something else might get squeezed out. This situation is a good example of the balancing act you have to do when you try to optimize performance.
As a general rule, strive for full normalization in your initial design because normalization eliminates opportunities to add redundant or inconsistent data to the database. When you have your design completely worked out, look for ways to balance normalization and performance. Good places to look are queries with multitable joins, aggregates, GROUP BY clauses, correlated subqueries, and foreign key references. Each of these queries can be time-consuming and use a large amount of data cache and create temporary tables in tempdb.
Find the Blockages
The ratio of users to pages or rows and the ratio of the amount of RAM to the size of the database, as we saw earlier, are just two factors that affect data flow. The way applications insert rows into a table can change the flow of data into and out of the table as well. For example, SQL Server will place rows that are inserted into a table that doesn't have a clustered index in the last page allocated to the table. So, locks for INSERT operations will cluster at the end of the table. SQL Server 6.5 has a special mode called Insert Row-level Locking, which essentially gives SQL Server row-level exclusive locks for INSERT operations. SQL Server 2000 and 7.0 have row-level locking for all rows, so both minimize the problems that happen as a result of clustering inserts on the last page. All versions, however, must use a page-level lock when the new row doesn't fit on an existing page. Both the existing page and the newly allocated page will have page locks while SQL Server adjusts their contents. While these exclusive locks remain in place, no operations involving those pages can proceed, including new inserts. This, of course, causes a brief disruption of the data flow.
OLTP applications typically use several tables that are the targets of high levels of INSERT operations, and these tables usually have a large percentage of users attempting to use them simultaneously. If you assume these tables have a small percentage of rows deleted at any time, then a very large percentage of inserted rows will reside on just two pages: the last one in the table and the one allocated to handle the overflow. Add in the fact that the SQL Server storage engine tends to cluster page allocations on one hard disk at a time, and INSERT operations can quickly overwhelm a system. Plus, as the number grows, the time users spend waiting for operations to complete grows as well, usually nonlinearly.
The solution is to make INSERT operations occur as quickly as possible and to spread them out evenly across the table's pages. One benefit of a clustered index is that inserts are scattered throughout the table in index order; therefore, finding the right index order can mean the difference between good and bad performance. For example, if the Orders table used the OrderDate column for the clustered index, new orders would cluster at the end of the table, causing a potential problem. If you use the CustomerID column instead, each customer's orders would cluster in one area, and overlap between customers should be minimal. Furthermore, it's unlikely that a given customer will place two orders concurrently or that two customers with adjacent CustomerIDs will be ordering at the same time.
Of course, the introduction of shared locks by analysis applications further aggravates data-flow problems because the existence of shared locks prevents the imposition of exclusive locks. Fortunately, shared locks tend to be short-lived, and analysis applications don't usually read newly inserted data. OLTP applications that delete rows in heavily used tables, however, will encounter conflicts as analysis applications lock the areas in which new rows are inserted. Where an OLTP application and an OLAP application use the same tables, you need to know which operations your application performs.
Queries that scan large sections of a table, such as those in most analysis applications, are good candidates for improvement. For multitable joins in which searching one of the tables requires a table scan, query performance decreases in proportion to table growth. For multistatement transactions that include a table scan on one of the tables, the length of time SQL Server holds locks increases directly in proportion to table growth. The typical way of eliminating table scans is to create an index, but remember that INSERT and UPDATE operations force changes to every index linked to the table. As the number of indexes grows and as the table grows, performance of change operations decreases. If you have a mixture of transaction processing and reporting applications using the same database, consider copying the data for reports to another table or database because often the queries that generate reports don't benefit from the same indexes that help transactions.
For the implied transactions of simple SELECT queries, shared locks probably won't cause a problem because SQL Server releases each one as soon as it has read the row. User-defined transactions, however, can be a problem because SQL Server doesn't release the locks held by the transaction until it commits the entire transaction. UPDATE statements that affect large sections of a table are a particular problem because SQL Server must hold an exclusive lock until the end of the transaction. SELECT queries that use the HOLDLOCK hint have the same effect. Consequently, user-defined transactions should be as short as possible. This delay is also a good reason not to start a transaction from a client application. If possible, execute all user-defined transactions through a stored procedure; if you must have a client application start a transaction, make sure the application can't delay processing after the transaction starts.
Go with the Flow
As a general rule, applying the concepts of feng shui to database design translates into looking for impediments to data flow. Blockages can arise through congestion in tables that are central to the database's design, through contention for rows, through overuse of the hard disk subsystem, through rotation of pages in the data cache because of queries on large tables, and through inefficient queries that hold locks too long or read through the data too many times—to name a few. Ways to open data flow include copying data for analysis into a separate database from the one used for transactions, denormalizing the design to eliminate excessive joins, splitting tables so that columns that are mostly read are in a different table from those that are mostly changed, increasing the resources available to the server, and placing user-defined transactions in stored procedures.
To put it simply, look for ways to spread the data requests across the widest area possible. For database design, that means being suspicious of diagrams that have a small number of tables linked to all the other tables in the database, especially in OLTP applications. For data storage, it means using a clustered index to spread the data so that you reduce the probability of multiple users wanting adjacent rows or pages. For hardware, it means making sure that available RAM is a large percentage of the database size, that the files are spread across many physical hard disks, and that the hard disk controller has enough capacity to handle the disk I/O. For applications, it means sharing resources such as connections and minimizing use of common resources, such as locks and transactions, as much as possible. For users, it means understanding what they do with the data so that you can adapt the database design to their needs and not to some impractical ideal. By being aware of the natural flows of data through your server and by using common-sense techniques for broadening the paths they take, you'll avoid many common problems long before they exist.
End of Article
Prev. page
1
2
[3]
next page -->