• subscribe
May 15, 2003 12:00 AM

Beware Row-by-Row Operations in UDF Clothing

SQL Server Pro
InstantDoc ID #39036

User-defined functions (UDFs) are powerful tools when used properly, but they can introduce inefficient row-by-row processing into an otherwise elegant set-based solution when used improperly. Unfortunately, many people don't realize the potential pitfalls of using UDFs.

Most experienced SQL Server professionals know that ANSI SQL cursors create slow, inefficient T-SQL code. Cursors have valid uses, but they're row-by-row operations, which are inefficient compared with set-based operations. Most of us try to avoid T-SQL cursors and instead use set-based operations. However, few people I speak with understand the subtle way that UDFs might cause a set-based operation to take on row-by-row processing characteristics—including the associated row-by-row performance problems.

For example, imagine a scenario in which you have

  • an Employee table containing 100,000 rows
  • a Department table containing 50 distinct values
  • a ranking system that assigns an employee "annual review grade" that's derived from data in other database tables

Imagine that your boss wants you to write a query that returns the average annual review grade for each department. Writing the query would be simple if AnnualReviewGrade were a column in the table. The query might look something like this:

SELECT
   DepartmentId
   ,avg(AnnualReviewGrade) AvgGrade
FROM
   employee
GROUP BY
   DepartmentId

But in our example, the annual review grade information isn't stored as a column. The AnnualReviewGrade calculation is a task that developers might need to perform in multiple pieces of code. Writing a join to get the information would be complicated, so the lead developer decides to write a UDF called GetAnnualReviewGrade that accepts an EmployeeId and returns the grade. You can now write the query for the average annual review grade as

SELECT
   DepartmentId
   ,avg(dbo.AnnualReviewGrade(EmployeeId) AvgGrade
FROM
   employee
GROUP BY
   DepartmentId

Now, let's think through the row-by-row processing implications of the UDF GetAnnualReviewGrade. Imagine that the UDF requires 15 logical I/Os to process, which might not seem bad. But remember that the UDF will be executed once for each row that needs to be evaluated. In this case, we'll be running the UDF once for each employee—100,000 times. That means the UDF alone adds 1.5 million logical I/Os to the processing cost of the query. In contrast, deriving the AnnualReviewGrade value for each employee by using a join or subquery might add just 5000 logical I/Os to the query. The UDF suddenly seems expensive. I've seen similar cases in which a query's processing time dropped from 15 or 20 seconds to less than 500ms when a developer replaced a complex UDF with join processing. Yes, the queries became more complex, and developers might have to code the business logic in more than one place. But dropping 15 to 20 seconds from a query's execution time might be worth the cost.

The problem with this UDF seems obvious. However, real-world problems are typically more difficult to spot, and you can usually see them only after you move code from development to production. The UDF that worked for a 1000-row result set in development might become a performance pig on a 1 million-row production result set. Replacing UDF logic with joins (and other set-based techniques) after the code is in production can be difficult and costly if the development team has used UDFs extensively.

I'm not saying that UDFs are necessarily bad. They're powerful T-SQL tools that I use regularly. However, I encourage you to think through how your code will use the UDF, paying close attention to the number of rows that might run through it in a query. UDFs might seem like a simple way to write set-based T-SQL code. However, you could open a row-by-row can of worms if you're not careful.



ARTICLE TOOLS

Comments
  • Paul McBride
    8 years ago
    Mar 27, 2004

    While I agree with the basic premise of this article, I also find it irresponsible. I use UDFs with great frequency. In fact, when used instead of a view, a UDF can substantially increase the performanceof a query (since views can't be parameterized, and UDFs can).

    In the example you give, the particular use of a UDF (in the select clause) would damage performance. But what if a UDF were used to return a table containing all of these averages, and the UDF were joined to in the FROM clause? We have isolated the logic for calculating the averages, we have made it available to multiple stored procedures, we have optimized the performance of this caclulation through parameterizing the UDF (to ensure we only perform the calculation for those records we will need in our final result set)--in other words, our UDF, performing the same function but as a set-based operation, has greatly improved the maintainability and potentially the performance of our code.

  • Eugen
    8 years ago
    Mar 26, 2004

    Even 'experienced SQL Server professionals' make mistakes, and even publish them: you talked about GetAnnualReviewGrade and then used AnnualReviewGrade in your example ('Get' is missing).
    Note: there is absolutelly NO NEED to publish this comment.
    Thanks

  • karakurum
    8 years ago
    Feb 25, 2004

    please code sample

You must log on before posting a comment.

Are you a new visitor? Register Here