DOWNLOAD THE CODE:
Download the Code 15961.zip

A little redundancy can boost Web speed and your bottom line

When it comes to database optimization for e-commerce, maximum performance for the Web user takes precedence over all else. Probably in no other type of dynamic application does SELECT statement performance take such priority over inserts and updates. (I say dynamic because an OLAP system is a totally different situation, in which data is usually updated monthly, quarterly, or even annually.) Contrast this type of dynamic Web application with a typical online transaction processing (OLTP) application you might run inhouse—for example, an order-processing system. You wouldn't want your order-entry personnel waiting around for extended periods for record inserts and updates, so in most OLTP applications, you need to take all types of Data Manipulation Language (DML) performance into account. But in online Web-application processing, at least 90 percent of Web customers' activity consists of browsing, not buying, which means your application needs to handle a high volume of SELECT statements (versus other data-manipulation operations). Consider your own online shopping habits. How much browsing and comparing do you perform before you purchase something (if you purchase anything at all)? And many customers browse online for information before finally going to a brick-and-mortar store to make the purchase.

In this read-intensive atmosphere, data-retrieval optimization is essential. Several obvious methods of improving SELECT performance become important: proper indexing, investigating query plans to ensure efficient index utilization, and using stored procedures for plan caching. Correctly indexing tables will probably yield the biggest performance gain; I also advise you to thoroughly evaluate for efficiency the queries that your application uses. Your database server would probably benefit from an additional 4GB of RAM, but many of us don't have that kind of budget. In addition to indexing and as an alternative to expensive hardware upgrades, one excellent and widely overlooked method that you can use to improve query performance is selective denormalization. In this process, you introduce a small amount of controlled data redundancy into your database. In certain situations, this method can yield dramatic performance increases.

The average e-commerce site has certain database-linked Web pages that most of its Web users hit—for example, a search results page that displays items, descriptions, and pricing information that match a user's search criteria. Query performance on such a heavily hit page needs to be as fast as possible because customers rarely stick around to buy products from a slow site. And because broadband Internet is becoming more widely available, performance will become even more crucial. You can hide a poorly performing Web page from a user who has a 28.8Kbps modem, but that same user might have a Digital Subscriber Line (DSL) next month.

Let's say you have an overtaxed page that receives 10,000 hits every hour and uses a query that executes a five-table join. You can significantly increase performance by turning that query into a simple single-table SELECT statement through some denormalization techniques, which I demonstrate in the following example. The example, which uses a typical e-commerce database schema, adds a new column and runs a bulk update. Any time you introduce redundant data into a database, you need to ensure that changes you make to the original field will propagate to the duplicate field; therefore, the example creates four triggers.

Two characteristics of e-commerce applications let you implement this selective redundancy. First, the Web user issues only SELECT queries against the product information tables. Any updates or inserts that users make will be in other customer information- or shopping cart-related tables. Second, inserts and updates to the product-related tables are performed infrequently and during calculated, low-volume periods.

If you tried this solution with a traditional inhouse OLTP application, table writes would be much more frequent and performance would suffer. So, before you fire up your design tools and start adding redundant columns to your database, make sure the gain in SELECT performance will outweigh the hit in INSERT and UPDATE performance. Web environments almost always benefit from this performance-tuning technique, so let's proceed.

Setting Up
This example involves a fictitious company that sells computers and equipment through mail order and over the phone but that now wants to sell its products online. To create and populate the tables for this example, download and run the script CreateTablesAndPopulate.sql from http://www.sqlmag.com. Figure 1 contains the database schema I use in this example.

This design might seem odd—for example, what's the difference between a Product and an Item? Often, when companies venture into e-commerce, they don't sell everything online that they sell through other methods, or they sell products in different combinations. This company sells individual products such as hard drives and mounting kits, but on the Web, the company sells those items only as a set. The Products table in the example schema represents legacy data. The Items table contains the most atomic unit of merchandise the company will sell online. The Products_Items table lets you associate Items for sale over the Internet with the existing units in the Products table. The schema also includes a table for Packages, which are groupings of Items that the company sells at a 10 percent discount. For example, a computer is a Package made up of Items such as a hard drive, RAM, a processor, and so on. This table structure lets you break down a Package in a customer's cart into its component Products for assembly.

In this scenario, the two most heavily hit pages contain lists of Items and Packages for sale. The major problem with this table design is the location of the price field. Getting information about an Item and its price requires a three-table join, such as the one in Listing 1's stored procedure. Obtaining Package information and pricing requires the five-table join in Listing 2.

These stored procedures would be simpler and run faster if the pricing information weren't five tables away in the Products table. To improve performance, you could add a price column to either the Items table or the Packages table. The company's research shows that most shoppers concentrate their browsing on Packages, not Items, so the best idea is to include the price column in the Packages table. Look at the GetPackageInfo stored procedure in Listing 2 again and imagine the difference if the Packages table included a PackagePrice column. GetPackageInfo would change to:

SELECT * FROM Packages

That's quite a savings in processor time when you consider how many users hit the Package page every hour: 10,000 single-table SELECTs incur much less overhead than 10,000 mega joins.

To begin modifications, create a nullable column of the money data type in the Packages table:

ALTER TABLE Packages ADD PackagePrice money NULL

With that column in place, you need a way to enter each row's pricing to reflect the sum of its Items, minus a 10 percent package discount. To accomplish this task, you can create a script such as the one in Listing 3 to update every row in the Packages table with the current prices.

Note that this script looks similar to the original GetPackageInfo stored procedure, with the addition of a temporary table. The temporary table #PriceHolder is populated with the PackageId of each record in the Packages table and the summed price of each record (minus 10 percent), which you obtain by joining the remaining tables. The PackagePrice column is then updated through a join on the #PriceHolder table. Because this job will affect every row in the Packages table, I recommend scheduling it to run during off-peak hours (typically in the early morning).

After you run the script, you have an accurate pricing column for each Package. Now you can rewrite the stored procedure to get Package information as

CREATE PROC GetPackageInfo2 @PackageID INT
AS
SELECT * FROM Packages WHERE PackageID = @PackageID

Running either 'EXEC GetPackageInfo 2' or 'EXEC GetPackageInfo2 2' returns from the Packages table the same information about the 500MHz computer with 64MB of RAM, but the second command puts much less load on the SQL Server machine. (Description fields tend to be long, so I suggest viewing the results in a Query Analyzer grid.)

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Excellent article! I recently architected a very large ecommerce site and used Jeff's techniques to dramatically improve performance.

Hal Helms