• subscribe
October 30, 2008 12:00 AM

Calculate Percentiles

Use T-SQL to improve your statistical analysis
SQL Server Pro
InstantDoc ID #100129
Downloads
100129.zip

A CASE expression in the SELECT list determines whether interpolation is required. If the input mark (@mark) is equal to one of the existing marks in the range, the CASE expression returns the corresponding percentile rank; otherwise, the CASE expression uses the aforementioned formula to apply interpolation. The code in Listing 4 returns the value 0.766667 (76.6667 percent).

With minor revisions to the code in Listing 4 you can calculate the percentile ranks of multiple values stored in a table. Listing 5 shows how to do so. The code populates a table variable called @MyMarks with a few random marks. The outer query in Listing 5 joins the table variable with PctRankRanges, to apply the calculation to each mark from the table variable as opposed to a single mark in a scalar variable. When I ran the code in Listing 5 on my system I got the output shown in Table 3. Note that the calculation returns NULLs for marks that are smaller than the minimum or higher than the maximum in the Marks table.

Percentiles
As I explained, a percentile is a value below which a certain percent of values fall. I’ll describe the way Excel calculates the PERCENTILE function, then I’ll explain how to implement similar logic with T-SQL.

Given an ordered set of values v1, v2, …, vcnt, and a request for percentile pct, Excel’s PERCENTILE function calculates the pctth percentile as follows:

Calculate n as pct * (cnt - 1) + 1

Using the marks from our Marks table as an example, the ordered set of marks is: v1 = 40, v2 = 55, v3 = 70, v4 = 70, v5 = 70, v6 = 75, v7 = 75, v8 = 80, v9 = 80, v10 = 95, v11 = 100. In our case cnt = 11. For pct = 0.95 (95th percentile) n = 0.95 * (11 - 1) + 1 = 10.5.

Now, let k be the integer part of n and d be the decimal part of n. In our case, k = 10 and d = 0.5. If d = 0, return vk. That is, when n is whole (n = k), return the kth value in the ordered set. If d <> 0, interpolate to produce the percentile as follows:

When k = cnt, return vk-1 + d * (vk - vk-1)
When k < cnt, return vk + d * (vk+1 - vk)

In our case, k < cnt (10 < 11), therefore the calculation is vk + d * (vk+1 - vk): 95 + 0.5 * (100 - 95) = 97.5.

The code in Listing 6, implements this logic and calculates the percentiles for a whole set of percents represented by the CTE called Pcts. Play with your own percents to see how the output changes. The CTE called CntMarks has a single row with the value cnt representing the count of rows in the Marks table. The query defining the CTE called PctCnt_n calculates for each percent from Pcts the value n described earlier. The query in the CTE called PctCnt_ndk breaks n to its integer and decimal components k and d, respectively. The query defining the CTE called MarksRn simply calculates row numbers (column rn) for the Marks from the Marks table. Think of the row numbers as the positions in the ordered set (1, 2, …, cnt), and the corresponding marks as v1, v2, …, vcnt. The outer query joins PctCnt_ndk and MarksRn, matching to each percent from PctCnt_ndk the relevant marks from MarksRn. To remind you, when d = 0, the relevant mark is vk; else, when k = cnt, the two relevant marks are vk-1 and vk; else, when k < cnt, the relevant marks are vk and vk+1. In all three cases, for each percent, the formulas for calculating the percentiles can be generalized to:

MIN(mark) + d * (MAX(mark) - MIN(mark))

The output of the code in Listing 6 is shown in Table 4. As you can see in the output, the median is the mark 75, the lower quartile is the mark 70, and so on.

Note that Excel interpolates to calculate the percentile when n is not whole. Other ways to calculate percentiles are to round n and return the value in the rounded position in the ordered set, but I’ll leave this method for you as an exercise. [Editor’s Note: If you try this method, please email us with your experiences. Send your feedback to lpeters@penton.com.]

Statistically Speaking
You can easily use T-SQL to implement statistical calculations such as percentile and percentile rank. In addition, you’ll likely encounter many other Excel-supported statistical calculations and other types of calculations that are handy for use in a database. Besides the usefulness of such functions in statistical analysis, trying to implement them with T-SQL is a great exercise.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here