For example, suppose you have a table with 249 or
more columns and no indexes. (SQL Server allows up to
1,024 columns in a table.) If you execute the sp_createstats
procedure in a database, SQL Server will create statistics
on every unindexed column in the entire database, and
your table will get 249 statistics, with indid values ranging
from 2 to 250. If the table has more than 249 columns, the sp_createstats command will actually report an error, telling
you that it can’t generate some of the requested statistics. If
you then try to create any nonclustered indexes, you’ll get
another error because no more indid values are available.
In SQL Server 2005, the index ID values are stored
as ints. The sysindexes compatibility view has rows for all
indexes and all column statistics. The sys.stats catalog view
contains rows for all statistics, both index statistics and
column statistics. Each row in sys.stats has a stats_id value
that’s unique for the corresponding object. Every index
will have a corresponding statistics row in sys.stats with
the same name (that is, the name of the index is the same
as the statistics name), and the index_id column in sys
.indexes is the same as the stats_id value in sys.stats. However,
not every statistics row has a corresponding row in
sys.indexes. There’s no overlap between the stats_id values
for column statistics and the stats_id values for index
statistics. This means that in SQL Server 2005, you’re no
longer limited in the number of statistics you can have on a
table, and having many column-level statistics won’t affect
the number of nonclustered indexes you can have.
When Are Column
Statistics Useful?
In general, statistics are used to give the optimizer information
about how many rows will satisfy a given predicate
in a WHERE clause or an ON clause. However, you might
wonder what good it will do the optimizer to know how
many rows will satisfy a given condition if there’s no index
to use to access those rows. For example, you should be
aware that a nonclustered index is typically a good choice
only when just a few rows need to be accessed. The SQL
Server optimizer uses the index statistics to determine
whether there are only a few qualifying rows; if so, the
optimizer will choose to use the nonclustered index.
Column statistics might tell the optimizer that there are
only a few rows, but without an index SQL Server can’t
find those few rows efficiently.
For a single-table query with no indexes on the table
at all, the only purpose that column statistics might serve
is to tell you what columns your queries are referencing,
to give you a clue about what columns you might want to
build indexes on. However, for queries involving multiple
tables, or tables that already have indexes, column statistics
can serve a very definite purpose. I’ll show you two
examples where the existence of column statistics can help
SQL Server choose a better plan.
Choosing a Join Order
If you ran the script in Listing 1, you have a Contact table
with 19,972 rows and an Employee table with 290 rows.
Neither table has an index. Let’s build an index on the join
column in the Employee table:
CREATE INDEX Emp_ContactID
ON Employee(ContactID);
The most efficient join plans usually occur when one table
has an index on the join column and the other table is
quite small or is returning only a few rows. But you have
an index on the join column in the small table, and the
other table is quite large. If you run the script in Listing
2 and look at the plan for joining these two tables, you’ll
see that the optimizer chooses to do a hash join, which is
rarely the most efficient type of join operation.
However, if you add a very restrictive WHERE clause
and run the code in Listing 3, you should now see a nested
loops join in the optimizer’s plan. Because of the automatic
statistics on the FirstName column, the optimizer
knows that only a few rows from the Contacts table will
be joined to the Employees table, so the filtered Contacts
table will be accessed first, followed by the index seek on
the Employees table.
If auto_create_statistics wasn’t set to ON, however, the
optimizer would have no way of knowing how many rows
might satisfy the WHERE condition and wouldn’t risk
performing the nested loops join with possibly thousands
of rows. To see this situation, turn off the SHOWPLAN
option and change the auto_create_statistics setting to
OFF, by using the following code:
SET SHOWPLAN_TEXT OFF;
GO
ALTER DATABASE AdventureWorks
SET auto_create_statistics OFF;
Now rerun the code in Listing 1 to recreate the two tables,
then rerun the code in Listing 3. You should see that the
optimizer will no longer choose to use the efficient nested
loops join and will use the hash join instead. Column statistics
on the FirstName column, even without an index,
are crucial to helping the optimizer to choose a good join
order and join algorithm.
Secondary Nonclustered
Index Columns
Another situation in which column statistics can help is
when values in the secondary columns in a nonclustered
index are distributed very unevenly. You should be aware
that for any composite statistics, SQL Server keeps a histogram
for the first column only. To see how this works,
set SHOWPLAN OFF and turn on auto_create_statistics,
as follows:
SET SHOWPLAN_TEXT OFF;
GO
ALTER DATABASE AdventureWorks
SET auto_create_statistics ON;
Then rerun Listing 1. If you build the following index on
the Contacts table, SQL Server will create a histogram of
values for only the LastName column:
CREATE INDEX NameIndex
ON Contact(LastName, FirstName);
Run the code in Listing 4 to view the plans for two similar
queries. One looks for first names that start with K, and
there are 1,255 of them. The other looks for first names
that start with Y, and there are only 37 of them. Although
the NameIndex doesn’t have distribution statistics for
FirstName, SQL Server has built column statistics. The
optimizer realizes that for the names starting with Y, there
are few enough that it’s more efficient to scan the leaf level
of the nonclustered index
and perform a lookup into
the base table 37 times. For
the 1,255 rows with a first
name starting with K, it’s
more efficient for the optimizer
to do a table scan, as
there are only 568 pages in
the table. Reading 568 pages
in a scan is better than scanning
all 101 pages of the
nonclustered index, then
following 1,255 pointers
to the table, which would
require another 1,255 page
reads. But when looking
for names starting with
Y, reading 101 pages in
the nonclustered leaf, plus
doing 37 table lookups, is
better than a table scan.
To see the effect of not
having column statistics
turned on in this case, you
could again turn off auto_
create_statistics, as follows:
SET SHOWPLAN_TEXT OFF;
GO
ALTER DATABASE AdventureWorks
SET auto_create_statistics OFF;
Run Listing 1 again to recreate the tables, and rerun the
CREATE INDEX command to build the NameIndex
index. Then run Listing 4 again, and you’ll see that now
the optimizer chooses to do a table scan on both tables.
Without column statistics on the FirstName column, the
optimizer can’t know that an index scan plus table lookup
will be more efficient than just scanning the entire table.
Keeping the Optimizer Informed
As you’ve seen, by allowing SQL Server to both automatically
create column statistics and keep all statistics updated
as your data changes, you can ensure that the optimizer
has additional useful data for determining query plans.
Enabling column statistics can go a long way toward
helping the optimizer come up with the most favorable
plans for your queries.
End of Article
Prev. page
1
[2]
next page -->