• subscribe
January 21, 2011 01:27 PM

Are Your DMVs Lying to You?

SQL Server Pro
InstantDoc ID #129188
Downloads
129188.zip

I was recently troubleshooting a problem with a critical production server running SQL Server. The server's CPU pegged at 100 percent. At that time of the day, the server's CPU was normally around 30 percent, so I knew something was wrong. A few weeks earlier, a similar problem arose and I had a hunch that a stored procedure was probably causing the trouble. So, this time around, I was prepared to gather more information. I had, ready to go, queries to obtain the cached plan and to obtain statistics from the dynamic management view (DMV) sys.dm_exec_query_stats.

Just as I suspected, the creation_time column in sys.dm_exec_query_stats showed that the stored procedure had recompiled just prior to when the high CPU problem started. I then compared the plan cached during the high CPU cycle to a "good" plan—that is, a plan that I had saved from when the server's CPU was at its normal rate. The comparison revealed that during the high CPU cycle SQL Server was doing a nested loop join when it should have been doing a hash-match join.

Based on this information, I decided to recompile the stored procedure with sp_recompile. Within a minute, CPU usage returned to normal. I again ran the query to get the cached plan. This newest plan looked the same as the good plan I had saved. Everything supported my suspicion that the stored procedure had recompiled and cached a suboptimal plan, which caused the high CPU—that is, everything except for some of the other statistics captured by sys.dm_exec_query_stats.

The high CPU problem spanned roughly 35 minutes, from around 7:35 a.m. to 8:10 a.m. I captured information from sys.dm_exec_query_stats at 8:08 a.m., just prior to when I recompiled the stored procedure. Here's what the DMV reported:

  • last_execution_time = 8:08 a.m.
  • execution_count = 25
  • total_worker_time= 5465822 microseconds

I became suspicious of these statistics for a couple of reasons. First, I knew 25 executions in 35 minutes seemed really low. Second, how could 5.46 seconds of worker time cause a 4´2.3GHz processor to be pegged at 100 percent CPU usage for nearly 35 minutes? It just wasn't adding up, which left me wondering whether the information returned by sys.dm_exec_query_stats was wrong.

Well, the information in sys.dm_exec_query_stats wasn't wrong, just misleading. Here's what happened. The stored procedure was recompiled at about 7:35 a.m. In the next few minutes, it executed 25 times. But then the performance of the stored procedure became so bad that each call started timing out. (I was able to confirm this by working with the developers and digging through the logs on a few of the application servers.) When the stored procedure call is made, the last_execution_time column is updated. But columns like execution_count and total_worker_time aren't updated until the execution completes, which makes sense. For example, SQL Server can't possibly update the total_worker_time column until the query has completed. Until that time, it wouldn't know how long the execution took to run. The same holds true for columns like total_logical_reads and total_logical_writes.



ARTICLE TOOLS

Comments
  • Moericke
    1 year ago
    Feb 28, 2011

    This particular issue was really due to a data load that was doing a full drop/reload vs. doing an incremental/differential load. What happened was essentially this;
    1) data was dropped
    2) our SP in question recompiled (due to auto-stats updating the statistics on the table)
    3) data was reloaded
    4) SP runs poorly because the plan was created when the table was empty, and one of the join operators was optimal when there was little/no data in the table, but is no longer optimal when the table is loaded with data.

    For many reasons, we updated the data loading process to only load the diffs, and not a full drop/reload. This was our "fix".

    So, this was a little different than the *typical* parameter sniffing issue.

    -Adam

  • Kelly
    1 year ago
    Jan 29, 2011

    Trunik, This is mainly caused by a condiiton known as Parameter sniffing. There are several ways to address it but I would take a look at Itzik's article ( http://www.sqlmag.com/article/sql-server/-using-the-recompile-query-hint-to-solve-parameter-sniffing-problems.aspx ) along with the following white paper. http://msdn.microsoft.com/en-us/library/ee343986(SQL.100).aspx

  • Trunik
    1 year ago
    Jan 28, 2011

    I see that you fixed the problem this time, but how did you handle this on an ongoing basis, ie. the next time the job runs?

  • stirk
    1 year ago
    Jan 28, 2011

    Hi,

    Very interesting article.

    For people that don't know much about DMVs, they can discover a lot more about improving SQL performance via DMVs in this forthcoming book www.manning.com/stirk. It contains more than 100 scripts to identify problems, and offers a wide range of solutions.

    Chapter 1 can be downloaded for free and includes scripts for:

    A simple monitor
    Finding your slowest queries
    Find your missing indexes
    Identifying what SQL is running now
    Quickly find a cached plan

    Thanks
    Ian

You must log on before posting a comment.

Are you a new visitor? Register Here