If you use SQL Server 2000, you've likely run into this problem: You want to
select a random sampling of rows from a large table with lots of rows, but you're
unsure of how to do so. Having a random sampling of rows can be useful when
you want to make a smaller version of the table or if you want to troubleshoot
a problem by seeing what kinds of rows are in the table.
To get a random sampling, you might be
tempted to select the top n rows from the
table. However, this sample isn't random and
the first n rows aren't necessarily representative of the whole table. Other solutions exist
that involve adding columns to the tables,
but adding columns isn't always possible or
practical.
The standard way to grab random rows from a small table is to use a query such
as
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
The key here is the NEWID function,
which generates a globally unique identifier
(GUID) in memory for each row. By definition, the GUID is unique and fairly random,
so when you sort by that GUID with the
ORDER BY clause, you get a random
ordering of the rows in the table. Taking the
top 10 percent (or whatever percentage you
want) will give you a random sampling of
the rows in the table.
The NEWID query is often proposed when questions about how to select random
rows are asked in discussion groups. It is simple and works very well for small
tables. However, the NEWID query has a big drawback when you use it for large
tables. The ORDER BY clause causes all the rows in the table to be copied into
the tempdb database, where they're sorted. This causes two problems:
- The sorting operation usually has a high cost associated with it. Sorting
can use a lot of disk I/O and can run for a long time.
- In the worst-case scenario, tempdb can run out of space. In the best-case
scenario, tempdb can take up a large amount of disk space that will never
be reclaimed without a manual shrink command.
What you need is a way to randomly
select rows that won't use tempdb and won't
get much slower as the table gets larger.
Here's a new idea on how to do that:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is to generate a random number between 0 and 99
for each row in the table, then choose all
those rows whose random number is less
than the value of the specified percent. In
this example, we want about 10 percent of
the rows selected randomly, so we choose all
rows whose random number is less than 10.
Let's take a closer look at how the
(ABS(CAST((BINARY_CHECKSUM(*)
* RAND()) as int)) portion of this query
works. The BINARY_CHECKSUM function generates a checksum value based
on the values of the columns you specify.
If two rows are different, they'll typically
generate different checksum numbers. The
BINARY_CHECKSUM function is generally used to verify whether any of the
columns in a row in a table have changed.
However, for our purposes, it generates a
number that looks like a random number
for each row.
The shortcoming of using the BINARY_
CHECKSUM function for our purpose is
that every time it's used on a row that hasn't
been modified, it returns the same checksum
number. Thus, when used by itself, subsequent runs of the query return the same
"random" set of rows, which obviously isn't
desirable.
To fix this shortcoming, we added the RAND function to the BINARY_ CHECKSUM
query. The RAND function scrambles the numbers returned by the BINARY_CHECKSUM
function. Thus, we get a different set of rows each time the query is run, making
it truly random. The ABS and CAST functions are used because BINARY_CHECKSUM(*)
* RAND returns a float that can be a negative number.
Prev. page  
[1]
2
next page