The key to great database performance
lies in finding ways to help SQL Server more quickly save or retrieve
data and reduce its response time. In my
previous article, "Database Design for Performance," (November 2006, InstantDoc
ID 93633), I explained how database
design affects performance. Building on
that foundation, let's look now at three
essential areas related to database design:
indexing, maintenance, and statistics. All
three are interrelated: When you have the
right indexes, database maintenance will
be effective, and effective database maintenance includes updating statistics, which
helps SQL Server better use the indexes
when responding to a query.
Indexing for Performance:
General Guidelines
Without going through all of the details
about how to best index, I can recommend some general guidelines you should
follow. First, choose the clustering key.
Next, add your primary and unique
constraints. Manually index your foreign keys. Finally, leverage tools such as
Database Engine Tuning Advisor to help
you do some workload tuning. For more
information about this tool, see the SQL
Server 2005 Books Online topic "Database Engine Tuning Advisor Overview"
at http://msdn2.microsoft.com/En-US/library/ms188639.aspx
However, this is not to say that indexing
for performance is simple—you need to
consider many factors. I'd like to focus on
two areas: clustered indexes and secondary, or nonclustered, indexes. Questions to
consider include the following: Should
every table have a clustered index? Is just
the existence of a clustered index good
enough? Are all clustered indexes equally
effective? With nonclustered indexes, does
the selectivity of the high-order element
matter? Should nonclustered indexes be
wider—that is, should they include more
columns? How much can these indexes
really help SQL Server to retrieve data
more quickly?
Clustered Indexes
When you create the key to a clustered
index, you need to be aware of several
factors. Because of how the storage engine
has changed from the way that SQL
Server 6.5 handled base tables, in SQL
Server 7.0 and later releases the important
things that I look for in a clustering key
are that it's unique, narrow, and static.
Unique. A clustering key should be
unique because it's used as the lookup key
for all nonclustered indexes. For example,
imagine an index in the back of a book.
To enable you to find the data that an
index entry points to, that index entry
must be unique. Otherwise, you wouldn't
find the exact item you're looking for.
Similarly, the clustered index needs to
be unique. However, SQL Server doesn't
require that your clustering key be created
on a unique column. You can create it
on any column. If the clustering key isn't
unique, SQL Server will make it unique
by adding a four-byte integer to the data.
Just realize that creating the clustered index on a nonunique clustering key
causes additional overhead at index creation, wastes disk space, creates additional
costs during INSERTs and UPDATEs,
and in SQL Server 2000, adds the cost of
a clustered index rebuild, which is more
likely to be needed because of the poor
choice of clustering key.
Narrow. A clustering key should be
narrow (i.e., composed of as few columns
as possible), for some of the same reasons
it should be unique. Because the clustering key is used as the lookup key for all
nonclustered indexes, the clustering key is
duplicated in all nonclustered indexes. If
the clustering key is too wide (i.e., composed of more columns than it needs to
be), then all of the nonclustered indexes
will be unnecessarily wide, which wastes
disk space, creates additional costs during
INSERTs and UPDATEs, and requires
more time (because of the greater size)
to rebuild these index structures. So, what
does narrow mean? It means using as few
bytes as possible to help uniquely define
your rows—a narrow numeric when possible, as SQL Server handles numeric data
types more efficiently than it does most
other data types.
Static. A clustering key should be static
for some of the same reasons it should be
unique and narrow. If the clustering key
is used as the lookup key for all nonclustered indexes, then the clustering key is
duplicated in all nonclustered indexes. In
fact, for a given table, the clustering key
will be the most-duplicated data. If this
data changes, then the value will have to be updated in the base table as well as in
every nonclustered index. If the key changes,
it will cause the record to move. When a
record moves, it creates fragmentation. Like
unnecessarily wide, nonclustered indexes,
fragmentation wastes disk space and creates additional costs during INSERTs and
UPDATEs. Fragmentation also increases
the time required to perform data updates
(because of record relocation and the likely
need for subsequent splits) and requires
more maintenance.
Typically, I recommend using a numeric
IDENTITY column as the clustering key,
but I'm frequently asked about using a globally unique identifier (GUID). A GUID does meet the criteria fairly well: It's unique, usually
static, and relatively narrow. So, what's wrong
with using GUID? Apart from being large and
somewhat incomprehensible, in SQL Server
2000, the function that generates GUID
values—newid()—doesn't create an ever-increasing pattern as an IDENTITY column
would. The ability to create an ever-increasing
pattern isn't one of the primary criteria I look
for in a clustering key, but it does add many
benefits. Without an ever-increasing pattern,
your table is likely to become fragmented and
perform poorly. For more information about
GUID, see the sidebar "Using the newsequentialid() GUID Function."
If the clustering key is ever increasing, then new rows have a specific location
where they can be placed. If that location
is at the end of the table, then the new row
needs space allocated to it, but SQL Server
doesn't have to make space in the middle
of the table. If a row is inserted in a location that doesn't have any room, then SQL
Server needs to make room. So, if you insert
a row based on last name, then as rows come
in, space will need to be made where that
name should be placed. To make room, SQL
Server makes a split.
Splits in SQL Server are 50/50 splits.
Simply put, 50 percent of the data stays and
50 percent of the data is moved. This keeps
the index logically intact even though it's
not physically intact. When an index has a
lot of splits, we call that index fragmented.
Good examples of an index that is ever
increasing are IDENTITY columns, which
are also naturally unique, static, and narrow.
Another example is something that meets as
many of these criteria as possible, such as a
datetime column, or even better than that, a
composite key of datetime + identity.
Nonclustered Indexes
The general strategy to take is that it's better
to have fewer, but wider composite indexes
than to have many narrow indexes. Narrow
indexes, such as single-column indexes, have
fewer uses than wider indexes. I've seen a
lot of debates over the selectivity of the first
column (known as the high-order element)
of the index, but it's not really important
whether the first column is the one that's
most frequently used, and that's not my
primary focus around selecting nonclustered
indexes. When you're choosing the right
nonclustered indexes, it's more important
to "know the system" and understand how
SQL Server uses indexes rather than assume
general criteria about the first column.
The high-order element of the indexes
is important; however, the element doesn't
need to be highly selective for the index
to be useful. (Note that the selectivity of
an index depends upon the percentage of
rows in a table having the same value for
the indexed key. An index's selectivity is
optimal if few rows have the same value.)
Let's consider a hypothetical table that's very
wide and that covers personal data similar to that of the census data for the United States.
This table is tracking a lot of items, and the
table width averages 1200 bytes per row.
Using our knowledge of page densities and
record format, we can see that we can fit
only an average of six rows per page. If the
table stores 300 million rows at six rows per
page, the table will be 381GB in size. Generally, a table of this size would be difficult
(and potentially slow) to analyze. However,
if a common aggregate query for the table is
one that returns the number of households
by "HeadofHousehold Gender," then "Age,"
then "Dependents," we're looking at a query
that needs only a small subset of columns
from the table.
Additionally, based on the column order,
this index is also useful for counting
rows by HeadofHouseholdGender alone
or HeadofHousehold Gender and Age
together. You can also use this index for
any left-based subset of columns, that is,
columns that comprise the left portion of
the index. The point is, the high-order element (HeadOfHouseholdGender) is not
very selective, yet this index is useful for
a variety of aggregates. Another thing that
makes this index so effective is its size. With
three columns that are likely to be stored as
a bit (Gender), a tinyint (Dependents), and
another tinyint (Age), the total for this index
key—including an index row header (four
bytes) and the clustering key (about eight
bytes)—is 15 bytes per row. Because this row
is so narrow, SQL Server can fit 539 rows
per page. The personal data table stores 300
million records, yet the index is only 4GB
in size. A 4GB structure is significantly more
optimal to read than if you have to read the
entire table. Also, if you have the data ordered
appropriately for the Group By command
that you need to retrieve, then SQL Server
can achieve a stream aggregate rather than a hash aggregate, which helps the query perform. Additionally, if this data is read-only, as
census data would be, you can also consider
using more interesting forms of indexes
such as those created by Indexed Views.
I'm not saying you can't create an indexed
view on transaction processing systems, but
there are more possibilities of blocking on a
transaction processing system.