Every DBA knows that there are certain aspects of SQL Server and its associated
hardware that you simply must monitor if you want to measure overall system
performance. However, not everyone knows exactly what to monitor, how to monitor,
and—most important—how to interpret the results. In this longer-than-usual
edition of "PTO for the Rest of Us," I'll go into some depth to explore one
of these crucial puzzle pieces: I'll use a built-in dynamic management view
(DMV) in SQL Server 2005 related to physical file access—sys.dm_io_virtual_file_stats()—and
I'll generate and use a small amount of custom code that will let you easily
report on this DMV's output. Although you'll find this DMV only in SQL Server
2005 or later, it replaces a similar function in SQL Server 2000 called ::fn_virtualfilestats().
This column's techniques, code, and principles will be specific to SQL Server
2005, but you can easily apply them to SQL Server 2000 with minor modifications.
For more on the latter application, join the discussion about this article in
the SQL Server Magazine performance-tuning forums (http://sqlforums.windowsitpro.com).
DMV Benefits
What can this wonderful DMV do for you? In a nutshell, this DMV lets you see
exactly what type of physical I/O your SQL Server system is experiencing when
you're reading from or writing to the files on disk. The physical I/O distinction
is important, compared with the metrics you might get from other SQL Server
functions and utilities, which include or report on only logical I/O. (Logical
I/O is when a page is accessed only in memory, or cache, whereas physical I/O
accesses the file on disk.) So, remember that if the I/O accesses only the SQL
Server cache (hence a logical I/O), the output of this DMV won't reflect that
I/O. Therefore, you can determine with certainty how much you're reading and
writing to your database files on disk. This DMV also indicates how long SQL
Server has taken to complete the reads or writes. And because the DMV tracks
that data at the individual file level, you can use the results to get a terrific
handle on several aspects of your system, such as how busy one file is compared
with another, or which file waits the longest for physical operations to occur.
Let's take a look at the raw DMV output, which Table
1, shows. Most of this data is self-explanatory, and all of it is occasionally
useful, but I'll focus on just a few for the purposes of this article. The data
about number of bytes read and written provides an accurate look at how much
data you're reading or writing to each of the files. The number of reads and
writes specifies how many individual requests were made for that time period.
This information can tell you two very important statistics about your system.
First, there will be no doubt as to which
database files have the most physical disk
I/O—both in terms of the quantity of data
(bytes read or written) and the frequency
of the I/O (number of reads or writes).
Lots of data read in bytes, combined with a
fewer number of reads, can indicate poorly
optimized queries that read too much data
and have difficulty keeping all the data in the
cache. Lots of write requests indicate a busy
system in terms of changes to the database.
Second, together with the I/O stall counters, this raw data can tell you exactly
where your disk bottlenecks are, in terms of the physical disk arrays the files
reside on. The I/O stall counters give you an indication in milliseconds of
how much time elapses during the read or write operation—or the combined
total, depending on the column. So, obviously, the file with the most I/O stalls
is spending the most time waiting. If you add up the stalls for all the files
that reside on a particular physical array, you can see which part of the storage
system is waiting the most. But that information alone doesn't necessarily suggest
that a given array is the poorest-performing. You also have to consider the
amount of data processed and the frequency of the I/O requests for that time
period.
Making It Work
Table 2 shows some data I gathered for two
files on my test system. One is a transaction log file, and the other is a data
file. Note that you must be careful when comparing different types of I/O. Log-file
traffic is typically smaller in size but accessed more frequently than data,
which is larger but characterized by less frequent I/O. These differences suit
the purpose of this discussion perfectly. As you can see, the average wait per
I/O for the data file is a little over 1 millisecond more than that of the log
file. However, the average megabyte per write on the data file is almost four
times the average megabyte per write on the log files. That result merely speaks
to the way data is written to disk—usually in a deferred fashion as opposed
to the transaction log's write-ahead logging behavior. When comparing one data
file with another, or a group of files that reside on one array versus another,
it becomes much easier to spot the offenders. Remember that the number of waits
or requests is always relative, and a high number doesn't always indicate a
problem. If you're writing a ton of data, you'll most likely see high values
in these columns. The goal is to keep the averages as small as feasible; there
will always be some latency or wait times because of the way storage systems
work. Microsoft's "Predeployment I/O Best Practices" white paper (http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx)
has some great information about preparing disk subsystems for use. The paper
also includes handy guidelines for determining how much latency is acceptable
under certain conditions.
Another effective way to utilize this DMV's information is in day-to-day monitoring
of your storage subsystem. You can take some measurements under known conditions,
then use that data as a baseline. Afterward, from time to time, you can compare
current metrics against the baseline to spot trends or even potential problems
before they get out of hand. This easy solution takes only a few lines of code.
The first thing you need to do is to capture the information at periodic intervals.
The information returned from the DMV is a snapshot in time, with values that
are cumulative since the most recent SQL Server restart. (There's currently
no way to reset these counters short of the restart, but such a feature would
be useful.) The DMV materializes the data as a resultset from an in-memory structure
on demand, and the end user can't directly change that data. Because the data
is a snapshot in time, you simply need to capture the data at periodic intervals
and store it in a table. You can then perform a delta of any two snapshots and
see how much physical I/O occurred and how much time passed between those two
points in time. To that end, I've created a relatively simple stored procedure,
which Listing 1 shows, to capture
the values and store them in a table residing in the tempdb database. If you'd
rather use another database, you can modify the code accordingly.
Prev. page  
[1]
2
next page